Incrementality Testing in Triple Whale
What is incrementality?
Incrementality measures the true causal effect of your marketing. In other words, incrementality tells you the true impact of your ads by answering a core question: How many sales are directly attributable to your ads, versus what sales would have occurred naturally without any advertising?
For example: Imagine you’re on your computer visiting a website intending to buy a certain scarf. As you surf the web, navigating to the site you’re served an ad for that exact scarf. After purchase, should the ad get credit for the purchase? Traditional attribution metrics like ROAS (Return on Ad Spend) would say yes, because the purchase happened after seeing the ad. But iROAS (Incremental Return on Ad Spend) asks the more important question: Would you have bought the scarf even without the billboard?
That difference; correlation versus causation, is the foundation of incrementality. It ensures you’re not just tracking activity, but uncovering the true impact of your marketing.
Why measure incrementality?
Incrementality establishes causality. Tools like Multi-Touch Attribution (MTA) and Marketing Mix Modeling (MMM) are powerful in their own right, but are both rooted in correlation, not causation. They can show how conversions increase when campaigns are active, but they can’t confirm that ads were the true drivers of those conversions.
MTA distributes credit across multiple touchpoints, but it can’t prove which ones truly drove the purchase.
MMM analyzes historical spend and outcomes, but a correlation between the two doesn’t necessarily mean one caused the other.
Incrementality experiments fill this gap by testing what happens when ads are shown versus withheld.
MTA and MMM are still powerful measurement tools and they can only become more effective when combined with incrementality. Ideally, these three methods should be used in tandem. By validating results and isolating true cause-and-effect, incrementality establishes causality that can be used to calibrate insights from your MTA and MMM.
MTA: High-frequency insights into individual touchpoints.
MMM: A cross-channel, big-picture perspective on spend and outcomes.
Incrementality: The causal framework that validates and grounds your findings in reality.
Together, this trio gives marketers the most complete, reliable measurement strategy possible
Popular types of Incrementality tests
Conversion Lift Studies
Conversion Lift studies are controlled experiments, often run on platforms like Meta or Google, that measure the causal impact of ads by comparing two groups:
Test group – exposed to ads
Holdout group – not exposed to ads
The goal is to isolate the incremental conversions generated by advertising, rather than relying on modeled or last-click attribution. Because users are randomly assigned into groups, these studies can provide a clearer view of whether ad spend is truly driving business outcomes.
However, Conversion Lift studies have several downsides:
Attribution reliance – They are tied to walled-garden platforms, which means results reflect only the platform’s view of conversions, lack transparency, and often miss cross-channel impact.
Not cross-channel – These studies typically measure only within a single platform and don’t account for broader marketing effects. Also, not all providers offer this kind of testing.
Privacy constraints – User-level data and randomization introduce privacy concerns and are becoming harder to execute as platforms restrict data availability.
Geolift
GeoLift shifts measurement from tracking individuals to evaluating regions, removing the need for attribution models altogether. By comparing test and control geographies, advertisers get a direct read on what marketing actually changes in the market.
Independent of attribution: No dependence on platform reporting or last-click models—impact is observed directly, not inferred.
Cross-channel visibility: Works across every channel and medium, digital or offline, without being limited by walled gardens.
Real-world results: Captures the true incremental lift in sales and conversions, not just what’s visible in clickstreams.
Privacy-safe and scalable: Because no personal data is needed, it’s resilient to privacy changes and flexible enough to measure any marketing initiative.
How GeoLift works
GeoLift is a causal experiment design that compares the performance of “treated” geographies with “control” geographies to estimate the incremental effect of a campaign. The key ideas are:
Treatment vs. control geographies. Your marketing budget is changed in selected test regions while similar control regions maintain their usual spend. Matching geographies on historical performance and demographics increases fairness[8].
Synthetic control. For each test region, GeoLift constructs a weighted combination of control geographies that approximates what would have happened without the spend change[9]. The difference between the treated region and its synthetic control reflects the incremental lift.
Power analysis. GeoLift runs simulations to provide recommended test designs (baseline vs. high confidence) and expected confidence intervals[10]. Selecting a high confidence plan increases the chance of detecting true lift.
Implementing the test. When you launch a test, adjust budgets as recommended and remove the holdout geographies from your campaign targeting on each ad platform so those audiences stop receiving ads. This ensures your control regions are not exposed to the campaign.
After the test period, the incremental lift is calculated by comparing the KPI (revenue, conversions, etc.) in the treated geographies to the synthetic control. The result is presented as iROAS, lift % and incremental revenue in Triple Whale’s dashboard.
GeoLift vs. Conversion Lift
Feature | GeoLift | Meta Conversion Lift |
Unit of randomisation | Geographic regions (DMAs, states, commuting zones). | Individuals within your Provider’s audience. |
Test intervention | Adjusts spend in test regions (increase or decrease). | Shows or hides ads to randomised audience segments. |
Data requirement | Historical KPI data per region; at least 90 days[11]. | Minimum $5,000 in spend and 500 optimised conversions; proper pixel tracking[12]. |
Holdout/control | Synthetic control built from weighted conversions in control regions[13]. | Meta automatically creates a control audience that does not see the ads[14]. |
Use cases | Measure incremental revenue or conversions across all platforms (paid search, social, etc.); calibrate attribution and MMM. | Measure incremental impact of Meta campaigns; calibrate Meta attribution. |
Incrementality in Triple Whale
Triple Whale provides a self‑serve incrementality dashboard where you can create, monitor and analyse experiments without leaving your command centre. Each experiment appears in the Experiments list with its name, type, status and date range (see example below).
Creating an experiment
Select the experiment type. From the Select Experiment Type screen, choose between GeoLift for geographic tests and Meta Conversion Lift for Meta’s native lift studies.
Configure your experiment.
Primary metric: Choose the KPI you want to primarily measure, so we can suggest you a test setup accordingly (e.g. New Customer Revenue, Revenue, New Customer Acquisition).
Regional granularity: Select the geographic level (DMA, state, etc.) for GeoLift.
Campaign selection: Select which campaigns or channels to include. Triple Whale shows the average weekly spend for each channel to guide selection.
5. Review recommended setups (GeoLift only). After selecting campaigns, Triple Whale runs a power analysis and presents multiple options for your experiment. Each option shows the minimum duration, required spend reduction, expected metric impact and the list of holdout regions. The recommended option is marked for convenience. You can view a map of the regions before proceeding. As input to the power analysis, you are asked to input your expected iROAS/CPiA (incremental Revenue over Ad Spend / Cost Per incremental Acquisition) of the test. Triple Whale offers a suggested value for this, calculated from your historical data, to serve as a reference point. Ultimately, your expert judgment is the most critical factor, so you should adjust this value based on your unique knowledge of the campaign.
6. Launch your experiment. Confirm the configuration and click Next. For GeoLift, Triple Whale will pause or adjust spend in the holdout regions according to the chosen option. For Meta Conversion Lift, Triple Whale calls Meta’s API to create the lift study on your behalf.
Monitoring results
After the experiment period ends, Triple Whale calculates and displays the results for each metric with confidence ratings. Results are not shown during the test because the control and treatment groups must be observed for the full duration to estimate lift accurately. For example, a GeoLift test might show:
iROAS (Incremental ROAS): the incremental revenue divided by incremental spend. This differs from platform ROAS because it accounts only for incremental conversions[16].
Lift %: the percentage change in the metric caused by the spend change. In the example below, a New Customer Revenue test shows a 25.4 % lift.
Incremental Revenue: the additional revenue attributable to the campaign, e.g., $1,100.
Triple Whale also compares the GeoLift results with attribution models such as first click, last click, linear and triple attribution (with and without view‑through). This helps you see how traditional attribution compares to the true incremental impact.
Additional metrics — like Revenue, Acquisitions and New Customer Acquisitions — are presented below the primary metric. Each section indicates whether lift was detected and provides a confidence rating (“High”, “Medium” or “No lift detected”).
Ending your experiment
Let the test run to completion. Ending a test early can reduce statistical power. The results page indicates when sufficient data has been gathered.
Restore normal spend. After the holdout period or when you stop the Meta lift study, return campaigns to their original targeting or budget.
Interpret and act on findings. Use iROAS, lift % and incremental conversions to inform budget allocation and refine your marketing mix. If the test shows little or no lift, consider reallocating spend to more effective channels or creative.
Best practices for GeoLift and lift studies
Choose relevant campaigns. Select campaigns or channels that generate enough conversions to produce meaningful lift estimates. Avoid including very small or recently launched campaigns that may not yield reliable results.
Follow Triple Whale’s recommendations. The platform’s suggested test duration, spend reduction and holdout region are based on power analysis. Sticking to these parameters balances statistical power and business impact.
Don’t Confound Variables. During the experiment, avoid major creative refreshes, promotions or budget shifts outside of the planned intervention. This minimises confounding variables.
Run the test to completion. Let the experiment run for the full recommended duration. Ending early can compromise statistical confidence.
Interpret and iterate. Once results are available, use iROAS, lift % and incremental revenue to inform budget reallocations and future experiments.
For more on how to operate your shop throughout the experiment see
Conclusion
Incrementality testing provides a rigorous, privacy‑safe way to measure the true impact of your marketing campaigns. By comparing test and control groups and focusing on lift, you gain actionable insights into which channels and strategies drive real growth. Triple Whale’s incrementality platform integrates both geographic GeoLift tests and Meta’s Conversion Lift studies, allowing you to design experiments, monitor results and act on insights within a single dashboard. Armed with metrics like iROAS and lift %, you can make data‑driven decisions and optimise your marketing spend for maximal business impact.
Understanding pre-test IROAS/CPIC
The goal of a GeoLift experiment is to confidently measure the campaign's true incremental lift. The power analysis determines the budget you need to make this true lift statistically visible, and your expected IROAS (Incremental Return on Ad Spend) is the key that unlocks that calculation.
The relationship is a direct trade-off: to see a specific, true lift, you can either spend less with high efficiency (high IROAS), or you must spend more to compensate for low efficiency (low IROAS).
How IROAS Impacts the Required Spend
Let's assume the true, underlying lift of your campaign happens to be 5%. The power analysis tells you what it will take to scientifically prove that 5% lift occurred.
With a High Expected IROAS: Your ad dollars are highly effective. The power analysis will show that a lower spend is sufficient to make a 5% lift statistically visible. Your campaign's efficiency makes the lift easier to see, requiring less investment to prove it.
With a Low Expected IROAS: Your ad dollars have a weaker impact. The power analysis will conclude that a much higher spend is required to produce and measure that same 5% lift, compensating for the poor efficiency.
The Risks of Budget Miscalculation
Setting the right budget is critical because your IROAS estimate directly influences the recommended spend. Misjudging this can lead to two negative outcomes:
1. The Risk of Underspending (The More Damaging Risk)
This occurs when you are overly optimistic and input a higher IROAS than the campaign actually achieves.
The Result: The power analysis recommends a budget that is too low. The campaign runs, but the true lift it generates is too small to be distinguished from the normal fluctuations in your baseline sales.
The Consequence: The experiment is inconclusive. You spend money on both the media and the measurement study but get no clear answer. You might wrongly abandon a good strategy because the test was "underpowered" and failed to detect its real impact. This is a wasted investment in both time and money.
2. The Risk of Overspending
This occurs when you are overly pessimistic and input a lower IROAS than the campaign actually achieves.
The Result: The power analysis recommends a very high budget to ensure the signal can be detected.
The Consequence: The experiment will almost certainly be successful and detect the true lift. However, you have allocated more budget than was necessary, leading to inefficient use of capital. That excess money could have been used for other marketing initiatives. While less damaging than an inconclusive test, it represents a significant opportunity cost.
Recommendation: Be Conservative When Uncertain
If you are not confident in your expected IROAS, it is always safer to be conservative in your estimate.
When performing the power analysis, inputting a lower, more cautious IROAS is the prudent choice.
Why? A conservative IROAS estimate will lead the power analysis to recommend a higher spend. This higher budget acts as an insurance policy. It ensures that even if your campaign's real-world efficiency is only mediocre, the investment will still be substantial enough to generate a measurable, statistically significant result.
An "overpowered" test that yields a clear, confident answer is far more valuable to your business than an "underpowered" test that ends in ambiguity. The risk of inefficiently spending on a successful test is often preferable to the risk of completely wasting your investment on a failed one.
Running a Clean GeoLift Test: Dos and Don’ts
The Golden Rule of Experimentation: Isolate the Variable
To get a trustworthy result from your GeoLift test, it must be treated as a scientific experiment, not just a marketing campaign. The fundamental goal is to determine if one specific change—your planned ad spend in the test regions—causes a specific outcome in sales or conversions.
For this to work, your ad spend must be the only significant difference between your test and control groups during the experiment. Any other major change you introduce is a confounding variable—an external factor that "pollutes" your results and makes it impossible to know what truly caused the outcome.
Think of it like testing a new fertilizer on a plant. You wouldn't test the fertilizer while also giving the plant more sunlight and a different amount of water. If the plant grows, you'd have no idea which of the three changes was responsible. Your ad campaign is the fertilizer; everything else is the sun and water. Keep them constant.
What to Avoid: Actions That Will Invalidate Your Test
These are the most common confounding variables that will risk your entire investment in the experiment. Avoid them completely.
1. Do Not Introduce Budget and Bidding Instability
Your planned budget and bid strategy are the core of the intervention. Changing them mid-test is like changing the dose of a medicine during a clinical trial and will make your results uninterpretable.
🚫 Example: Two weeks into a four-week test, you notice performance is strong, so you decide to double the daily budget in the test regions. This makes it impossible to separate the effect of the original budget from the effect of the sudden increase.
2. Do Not Change Your Creative, Offer, or Landing Page
Your ads and the user journey must remain consistent. Otherwise, you won't know if a lift came from the ad spend or from a more compelling new ad that wasn't shown for the full duration.
🚫 Example: The variable you are testing is the spend on a specific platform (e.g., Meta). If you change the ads themselves on that platform mid-test, you are introducing a second variable, making it impossible to isolate the impact of the spend alone.
3. Do Not Run Major Overlapping Promotions
A site-wide sale is a massive shock to the system that will almost certainly drown out the subtle signal from your incrementality test.
🚫 Example: You run your annual "Black Friday in July" 30% off site-wide sale. This will cause a massive surge in purchases across all regions (test and control), making it impossible to isolate the comparatively smaller effect of your ad campaign.
What is Safe: Running Your Business as Usual
A business is a living entity and cannot be frozen for a month. The key isn't to stop all activity, but to ensure that your routine operations are applied uniformly across both test and control regions. As long as a change is part of your normal rhythm and affects all customers equally, it becomes part of the stable "baseline" and will not invalidate your test.
1. BAU Budget Adjustments
This contrasts with reactive instability. If your marketing plan involves scheduled budget changes based on a consistent, pre-existing strategy (like seasonal scaling) and is applied broadly, it is considered part of the baseline. Unplanned changes based on early test results are unsafe.
2. Website and Product Management
✅ Adding new products to your store, provided it's a soft launch. This is a standard business activity. Since the new items are available to all customers nationwide, the change is applied equally. However, if the product launch is supported by its own major, nationwide marketing campaign (e.g., a large email announcement or a separate ad budget), that campaign introduces a massive confounding variable and must be avoided.
3. Ongoing Marketing Activities
✅ Sending your regular email and SMS campaigns. As long as you send these campaigns to your entire list as usual, this activity is a consistent part of your marketing baseline. The test is designed to measure the additional lift from your paid ads on top of this stable rhythm.
By following these guidelines, you can maintain the scientific integrity of your experiment while still running the day-to-day operations of your business, ensuring the insights you gain are both trustworthy and actionable.
WHAT CAN YOU TEST?
Is a new channel incremental?
Does Branded Search drive new sales, or are those customers likely to purchase anyway?
What’s the right mix of upper and lower funnel ads?
Validate MMM results by testing tactics against each other
Confirm how many of the conversions in your MTA are actually driving value
Determine the optimal spend amount for existing campaigns
TESTING A NEW CHANNEL
Scenario: Imagine a company identifies a market, let’s say Meta, that appears to be performing well. Occasionally they put ad spend into Meta but haven’t seen a strong enough signal to confirm whether investing in this channel will pay off.
To test the hypothesis they run an incrementality test:
Holdout | 50% |
Duration | 4 Weeks |
Primary KPI | Revenue |
Test: This means that for the duration of the test, 50% of their audience won’t see any ads. After the test, we compare results between the population that saw ads and those that didn’t. The difference in the quantity of Revenue between the Holdout and Treatment group tells the incrementality of this channel.
Takeaway: In response to this positive result, the customer will likely want to scale up the channel. Afterwards, they should test again at the higher spend level and validate how much to continue scaling up the channel.
TESTING IMPACT OF UPPER FUNNEL ON SALES
Scenario: The company is now investing in its marketing and optimizes Meta campaigns for lower-funnel goals like clicks, revenue, and purchases. But because these tactics target a limited pool of in-market users, the audience becomes oversaturated and returns diminish.
To grow beyond this stage, the company can shift to upper-funnel tactics that create new demand and reach customers earlier in their journey. But these efforts are harder to measure, since they don’t drive immediate sales and aren’t fully captured by standard attribution methods.
Test: They design an experiment focused on upper-funnel conversions, like newsletter sign-ups, to evaluate whether these campaigns expand the pool of potential customers and ultimately contribute to future sales. Half of the country is kept in the holdout group and is exposed to no upper-funnel media, while the other half is served an upper-funnel campaign on top of existing campaigns.
Takeaway: Results show that regions exposed to upper-funnel campaigns significantly outperform the holdout group, confirming that upper-funnel tactics drive stronger downstream sales. Without this controlled test, the impact would not have been visible in standard platform metrics.
VALIDATING MMM AND MTA RESULTS
Scenario: A company uses both Multi-Touch Attribution (MTA) and Marketing Mix Modeling (MMM) to guide its budget decisions. MTA reports that retargeting campaigns on Meta are delivering strong returns, while MMM indicates that paid search is the biggest driver of incremental revenue. The two models provide conflicting stories, and the marketing team isn’t confident in how to allocate budget effectively.
Test: To validate these models, the company runs a GeoLift incrementality experiment. They select several geographic regions to act as holdouts, pausing retargeting and paid search in those areas while keeping spend constant in the treatment regions. After four weeks, they compare performance across test and control groups, measuring incremental lift in both Meta retargeting and paid search. The use of holdouts creates a counterfactual, a clear picture of what would have happened without the ads. This counterfactual baseline makes it possible to isolate the true causal effect of each channel, something neither MTA nor MMM can do on their own.
Takeaway: Results reveal that Meta retargeting, while over-credited in MTA, contributes only modest incremental lift, whereas paid search delivers meaningful incremental revenue, closer to what MMM suggested. By validating each model against a causal framework, the company can refine both MTA and MMM, using incrementality as the ground truth to reconcile discrepancies and guide smarter budget allocation.
VALIDATING MMM EFFICIENCY METRICS
Scenario: An apparel brand uses Marketing Mix Modeling (MMM) to guide its media mix. The model indicates that Pinterest is one of the most efficient channels, showing a strong revenue return for every dollar spent. However, the team knows MMM is correlation-based and wants to validate whether Pinterest truly delivers that level of efficiency.
Test: To do this, they run a GeoLift incrementality experiment. Several regions are randomly assigned to a holdout group where Pinterest campaigns are paused, while other regions continue receiving ads as usual. After four weeks, the company compares sales between the test and holdout groups to calculate Pinterest’s incremental ROAS (iROAS).
Takeaway: By comparing against the holdout, the company discovers that while Pinterest is incremental and its true iROAS is lower than what MMM predicted. This helps the team recalibrate expectations, validate MMM’s outputs, and make more confident budget decisions grounded in causality rather than correlations.
HALO EFFECTS
Scenario: The company sells through its website and also on Amazon. They have a good sense of the impact of their marketing on direct sales but want to understand the halo effect of their ads on Amazon sales. In particular they wanted to test Google’s halo effect on Amazon sales.
Test: They design a GeoLift experiment with Amazon orders set as the primary KPI. 60% of the country continues running Google ads (treatment), while the remaining 40% pauses Google (holdout). After the test period, they compare Amazon sales between the two groups to measure the incremental halo effect of Google advertising.
Takeaway: They observe that Google drives a significant amount of incremental Amazon sales. This not only highlights the halo effect of Google advertising but also improves Google’s iROAS when accounting for its omnichannel impact. With this insight, the company can confidently scale Google, knowing it fuels multiple revenue streams. Conversely, they can adjust Amazon investment, recognizing that its efficiency is not solely due to SEO, merchandising, or Prime promotions as previously assumed.
LONGER CONSIDERATION WINDOWS
Scenario: A company sells several high–average order value (AOV) products that naturally involve longer consideration periods. For these items, the customer journey often stretches across weeks, with multiple research and touch points before purchase. Traditional last-click attribution is especially misleading here since it ignores the broader path to conversion. Platform-reported metrics don’t solve the issue either, since each platform tends to claim full credit for its touch points, leading to inflated results and overlapping attribution.
Test: To better understand true performance, the company runs a Meta incrementality test. They design a 2-week treatment period where a holdout group receives no Meta ads, followed by a 3-week post-treatment observation window. During the treatment period, both groups appear to perform similarly, suggesting little immediate lift. But as the observation window progresses, the treatment group begins to significantly outperform the holdout, reflecting Meta’s role in influencing longer, more complex buying cycles.
Takeaway: The experiment shows that Meta campaigns are driving incremental sales, but the impact emerges over a longer horizon than standard attribution would capture. By measuring results beyond the campaign’s active window, the company uncovers the true value of Meta in shaping demand for high-AOV products value that would have been missed by last-click metrics or platform reporting alone.