MAB vs. AB-testing for product optimization

TL;DR

In product optimization, multi-armed bandit algorithms (MAB) outperform serial A/B testing (SAB) by requiring significantly less traffic to identify winning variants and delivering a higher overall conversion rate (CVR) uplift. Hence, Levered uses specialized, contextual bandit algorithms to efficiently optimize apps and websites.

In this article, we explore why MAB algorithms outperform traditional A/B testing in identifying variants that drive conversion improvements. We demonstrate this through a simulation of a typical website optimization scenario.

[This is the short version an article published here.]

Growth optimization vs. product testing

Incremental product growth optimization is inherently different from traditional product work. That’s also why companies such as Meta have dedicated growth orgs that are separate from the core product organization.

In “growth optimization”, testing velocity is key. It differs from classic feature testing in several ways

Smaller changes: Adjustments are minor and inexpensive, such as changing text or design elements.
More subtle effects: Most changes are low-risk and have a relatively small effect size (positive or negative). Yet, the cumulative effect can be large.
Lower success rate: Few changes are successful. Most don’t bring a statistically significant conversion-rate improvement.

As a result, success in growth optimization hinges on a team’s ability to identify and accumulate many small wins, whereas core product work focuses more on de-risking bigger bets.

Limitations of classic AB-testing

In the context of growth optimization, A/B testing often falls short due to the large sample size required and the rigidity of the statistical approach:

Detecting small uplifts takes too long: For example, imagine a website with 100k monthly visitors and a 1.5% baseline conversion-rate (typical SMB eCom store). To detect a 10% incremental uplift with 95% confidence interval an AB-test needs to run for 8 weeks (8 months for a 5% uplift).
Multivariate tests aren't feasible: Given the traffic requirements, very few companies can test more than one variable at a time. As a result, dependencies between variables usually remain unknown. For example, identifying the optimal combination of price points and subscription plan features is rarely feasible through A/B testing.
Few ideas can be tested: Teams typically cannot test all their ideas and must make difficult prioritization decisions based on correlational signals. As a result, "success" often depends on the PM’s experience, gut feeling, and luck in selecting the right bets.
Opportunity cost are high: A/B testing hurts overall conversion rates, as a fixed portion of traffic (typically 50%) remains allocated to the underperforming variant for the entire test duration

Multi-Armed-Bandits

Multi-armed bandits are not new. They have been used successfully in areas such as search or ads optimization for quite some time. However, they are much less commonly used in product optimization compared to A/B testing.

MAB algorithms balance two competing goals: exploration and exploitation. Exploration aims at testing as many ideas as possible, while exploitation aims to maximize the overall conversion rate by showing ideas that worked to as many users as possible.

An effective technique to balance exploitation and exploration is called “Thompson Sampling.” Simply put, it works like this:

Start with a guess: Assume all your options are equally good.
Test one variation: Pick one option randomly, show it to a user, and see how they respond.
Update your guess: Adjust your belief about how good that option is based on the observation (did the user convert or not?).

MAB algorithms are particularly data-efficient when paired with so-called “hierarchical Bayesian” models. These models recognize that a product design is a function of multiple variables (so-called “factors”) that may vary in importance.

For example, on a product page, the product image may be more important than the product description and have more influence when it comes to optimizing conversion. The algorithm learns this from user interactions and allocates traffic accordingly—i.e., it prioritizes finding the best image over the best product description.

The hierarchical approach offers a significant advantage over A/B testing, since it helps avoid wasting traffic on finding the best levels of unimportant factors.

Benchmarking algorithms

At Levered, we use custom hierarchical MAB algorithms for automated product growth optimization. But how do we know this works better than running a series of A/B tests? And how can we quantify “better”?

The best way to directly compare the two approaches is by running a simulation. In such a simulation, we define a typical product optimization scenario and then observe how effective each algorithm is in finding “winners” and improving the overall conversion rate over time.

Defining the scenario

We need to define a scenario that is a fair representation of the problem that a growth or CRO team faces when optimizing a user experience, e.g. a landing page, Shopify store or the product onboarding journey of a SaaS platform.

Let's define the scenario as follows:

We are optimizing a UX across three variables (“factors”), e.g., the headline, hero image, and CTA copy of a landing page. For each factor, we want to explore four different levels (e.g., four different headlines). This makes 4×4×4 = 64 possible variants.

The three different factors vary in importance and each variant has a “true” conversion rate between 2% and 4% (both ex-ante unknown). The team now aims to find the best-converting variant by either running a series of A/B tests or an MAB optimization.

Protocol for AB-testing

Starting from a baseline variant and an alternative variant, we assign incoming users to one or the other and compute empirical conversion rates for both variants. When we hit a given sample size, we carry out a pairwise hypothesis test. For this statistical test, we assume under the null hypothesis that the baseline outperforms the alternative. The test then calculates how likely the observed data is under null.

If we fail to reject the null hypothesis, we assume that the effects we've seen in the data so far are from pure chance and stick with the baseline variant. Otherwise, we adopt the alternative as the new baseline.

This process is repeated until all variants have been tested, or the available sample size is exhausted. The final "null" is kept as the winning variant.

Protocol for Multi-Armed-Bandit Optimization (MAB)

For MAB, the protocol is even more more straightforward:

All 64 variants are tested at the same time. Initially, all variants have the same weight, i.e. have the same chance of being shown to a user. We then use Thompson Sampling to gradually shift more traffic to the best performers. Ultimately, all traffic is allocated to the winning variant.

[Find a more detailed explanation with code examples in the long version of this article here.]

Quantifying the performance of both methods

With two competing alternatives formally defined, we may now address our experimental protocol. We empirically test MAB and AB-testing on the design seen in Figure 1. Performance of the algorithms is quantified statistically, that is, in terms of average performance over a large number of experiments. We focus on three characteristics:

Average Conversion Rate: How well, on average, each method turns visitors into customers over time.
Variance of Conversion Rates: Whether the results are steady or all over the place.
Chance of Beating the Other Method: The likelihood that one method outperforms the other as we run more tests.

In the case of MAB, the conversion rate of the system corresponds to the average over all variants, weighted by the optimality probabilities. In the case of AB-testing, it is the average of the conversion rate of the baseline and the alternative.

Figure 3. Conversion uplift of MAB (orange) and AB-testing (blue).

We visualize the outcome of the experiment in Figure 3, where we show the mean conversion rate uplift of both MAB and AB-testing as solid lines, as well as the interval which contains 50% of simulations. These are shown as functions of the number of user interactions that the system has carried out. We observe that on average,

MAB reaches 3x higher relative uplift,
needs 7x less traffic to reach 10% uplift
generates about half as much standard deviation
outperforms AB-testing in 85% of experiments after 15k interactions

In conjunction, these observations support our conceptual reasoning about the advantages of MAB, and indicate that substantial conversion uplift can be achieved even with moderate traffic volumes.

Discussion

So what does this mean for businesses in need of data-driven product optimization? Should they quit A/B testing and go all in on MAB?

When it comes to testing expensive changes, such as new feature launches, A/B testing still is a valid approach. That’s because the risk and expected change in CVR are relatively high. The larger the company and the more traffic the product gets, the more likely it is that A/B testing will work out just fine.

However, when the space of options is large, traffic is sparse, and the potential cost of experimentation is moderate, MAB is the superior alternative, as shown in the simulation above. This applies particularly in the context of CRO in small and medium enterprises.

MAB vs. A/B Testing: Choosing the Right Algorithm for Growth