A biomarker is a measured characteristic of a patient that relates to biological processes. Genetic markers, antigen presence, and chemical concentrations are all examples of biomarkers. Biomarkers may be classified as prognostic, predictive, surrogate endpoints, or some combination of the three. Prognostic biomarkers give information about the expected course of a disease absent intervention, while predictive biomarkers indicate how various treatments are expected to perform. Surrogate endpoints are endpoints that are not the endpoint of primary interest, but for which the treatment effect predicts the treatment effect on the primary endpoint. I’ll focus here on predictive biomarkers.

In clinical practice, predictive biomarkers are attractive for determining how to treat particular patients. In the clinical trial setting, then, a common goal is to use predictive biomarkers in order to determine to whom a treatment should be given. Specifically, if a predictive biomarker has been identified, a trial may seek to determine whether a treatment should be developed for the entire population, or only for a subset defined by that biomarker. Such a determination is the goal of most biomarker-guided clinical trial designs.

### Stratified designs

Perhaps the most direct approach to determining whether a treatment should be developed for the entire population versus a subset is to enroll a patients from the broad population and then examine the treatment effect stratified by biomarker.
Trials using this strategy are said to use *stratified designs*.
The difference between a stratified design and the simple randomized control trial is that patients are not enrolled in the stratified design if their biomarker status cannot be determined (Mandrekar et al, 2013).
This ensures that subgroup membership is not subject to missing data.

At the conclusion of the trial, treatment effect is tested in strata defined by the biomarker.
In a sequential testing approach, a single hypothesis (e.g., a treatment effect with respect to a single endpoint) is either tested first in a subgroup and then in the broad population if the first test is significant(Bauer, 1991). This approach is called a *closed testing procedure* and is especially useful when there is substantial prior evidence that the treatment effect is strongest in the subgroup and that the the subgroup is sufficiently large in the population to provide adequate power. Alternatively, when the treatment is expected to be broadly effective, the order of the tests may be reversed.
Both of these sequential testing procedures maintain the familywise type I error rate at the nominal level of the individual tests.

Another approach to stratified designs is the *marker-by-treatment interaction design* (Sargent et al, 2005), in which the biomarker is used to partition the population into two groups. Patients with valid (but not necessarily positive) biomarker test results are enrolled and then randomized separately within each biomarker stratum. The sample size of the study is determined in order to provide sufficient power to test the primary hypothesis independently in each stratum.

Finally, treatment-by-biomarker interaction tests may be carried out prior to the primary hypothesis tests, and subset-specific hypotheses tested only if the corresponding interaction test is significant (Pocock et al, 2002). However, it must be noted that the sample size necessary to power such interaction tests are often only feasible in phase III settings.

### Enrichment designs

When a biomarker is rare in the population, standard stratified designs may be underpowered to detect treatment effects in the biomarker-positive subgroup. *Enrichment designs* address this obstacle by screening patients prior to enrollment and enrolling a disproportionately high number of biomarker-positive (or biomarker-negative) patients. Such a design differs from stratified designs in that some patients are turned away based on the result of their biomarker test, even when that test is valid. In some extreme cases, investigators may choose to enroll *only* biomarker-positive (or biomarker-negative) subjects (Simon and Maitournam, 2004).

The working assumption for enrichment designs is that only a subgroup defined by the biomarker will benefit from the treatment. While these designs are most useful when the biomarker prevalence is low, they may still be appropriate in small pilot studies when the biomarker prevalence is moderate(Mandrekar and Sargent, 2011). If the pilot study provides adequate evidence that the treatment effect is substantially stronger in the biomarker subgroup, a larger study may account for this heterogeneity in its design.

While enrichment designs can provide large power benefits over trials targeting an overall treatment effect (disregarding biomarker status), this advantage is attenuated when the alternative is a biomarker strategy desgin (Freidlin et al, 2010). Additionally, restricting enrollment to biomarker-positive patients (or preferring them) may dramatically slow enrollment, especially for biomarkers with low prevalence.

### Biomarker-strategy designs

A much different type of design aims not to identify a treatment effect in the broad or biomarker-specific population, but to determine whether using knowledge of patients’ biomarkers to guide treatment choice improves outcomes. Such a design is known as a *biomarker-strategy design* (Freidlin et al, 2010).

In its simplest form, a biomarker-strategy design randomizes patients into a control group and a strategy group. Patients in the control group are assigned the standard of care, assumed to be the most effective in the broad population. Patients in the strategy group are assigned to one of two treatments (including the standard of care) based on the result of a biomarker assay. Variations of biomarker-strategy designs include assigning patients to one of more than two treatments in the strategy arm (Rosell et al, 2008), and randomizing patients in the control arm among the same treatments assigned to patients in the strategy arm.

An example of the type of inference that can be made from a successful trial is that assigning biomarker-positive patients to the test treatment rather than the standard of care results in better outcomes. Such a conclusion is similar to the conclusions of stratified and enrichment designs, namely, that there is a treatment effect in the biomarker-positive subgroup. However, it is not possible for a biomarker-strategy design to distinguish between the situation in which the test treatment outperforms the standard of care on the entire population from the situation in which the test treatment outperforms the standard of care on the biomarker subgroup only.

The statistical properties of biomarker-strategy designs are also inferior to those of startified and enrichment designs, as assigning a substantial portion of the strategy group to the same treatment as those in the control group dilutes the outcome difference between arms and may result in an underpowered study(Freidlin et al, 2010).

There are however, advantages of biomarker-strategy designs that may make them attractive options. The first is that when the biomarker strategy involves a large number of treatments or biomarker levels, it is often impractical to randomize patients in every stratum. Additionally, if there is prior knowledge of optimal treatment for certain biomarker levels, it may be unethical to randomize those patients to another treatment(Freidlin et al, 2010).

### Adaptive designs

Clinical trial designs incorporating biomarkers afford unique opportunities for adapting designs at interim timepoints. More traditional adaptations, such as sample-size re-estimation (Chuang-Stein et al, 2006), may also be used to alleviate common obstacles when dealing with biomarker designs (Wang et al, 2007).

If there are more than two treatments being tested, or several biomarker subgroups involved in the trial, interim analyses may provide sufficient evidence to conclude that some treatment arms, biomarker subgroups, or combination of the two are no longer promising. Adaptive designs may handle these sorts of situations by dropping arms or subgroups from the trial (Mandrekar and Sargent, 2011). In related designs, randomization ratios among arms may be altered as knowledge about treatment effects accumulates, but before arms are dropped entirely.

Another possible adaptation is to alter the threshold of a biomarker measurement that defines a subgroup. For example, if CD4 count is expected to be a determinant of efficacy for an HIV treatment, but the appropriate threshold is unknown, a trial could adjust the threshold at interim analyses so that the final analysis tests the appropriate hypothesis (Jiang et al, 2007). Additionally, adaptive enrichment designs (Simon and Simon, 2013) can use adaptive thresholds to modify enrollment criteria in order to boost power even when the target subgroup is not known ahead of time. However, modifying the targeted subgroup muddles the interpretation of significance tests at the final analysis—the alternative hypothesis is that there is *some* subgroup that benefits from the treatment, but does not specify which subgroup that is.

### BATTLE

A high-profile completed example of a biomarker-guided clinical trial design is the *Biomarker-integrated Approaches of Targeted Therapy of Lung cancer Elimination* (BATTLE) trial (Zhou et al, 2008). The goal of the BATTLE trial was to determine which of four treatments was most effective in treating patients with non-small cell lung cancer (NSCLC) for whom chemotherapy had failed. Several biomarkers related to the treatment mechanisms were expected to induce heterogeneity in the treatment effects.

The BATTLE trial was a biomarker-stratified design with adaptive randomization ratios. In the design, four treatments were to be considered for five biomarker-defined subgroups. As patients enrolled, they would be classified into one of the biomarker groups, and then randomized to one of the treatments. A Bayesian model was to be used to adaptively set the randomization ratios *in each biomarker group* in response to observed outcomes (disease control at 8 weeks after randomization). A total of 200 patients were expected to be enrolled, with early stopping rules for futility implemented.

At the completion of the study (Kim et al, 2011), a total of 341 patients had been enrolled over the course of three years. Equal randomization was used for the first 97 patients, and an additional 158 were adaptively randomized. The rest of the patients could not be randomized due to other illnesses, worsening condition, inability to provide a biopsy, or choosing an alternative treatment. Each treatment had at least an 80\% posterior probability of positive efficacy in at least one biomarker subgroup, and different treatments benefited patients in different subgroups.

Several important limitations of the BATTLE trial emerged during implementation. Most notably, the five subgroups defined by the biomarker combinations were less predictive than the individual biomarkers. This diluted the effect of strong predictive biomarkers on the randomization ratios. Additionally, several of the biomarkers were very poor predictors of treatment efficacy. Finally, the effect of adaptive randomization was limited in certain biomarker subgroups because there was not a large difference in treatment effects among candidate treatments.

### I-SPY 2

A major in-progress biomarker-guided clinical trial for locally advanced breast cancer treatments is the *Investigation of Serial studies to Predict Your therapeutic response with imaging and molecular analysis 2* (I-SPY 2). (Barker et al, 2009) describes the goal and broad outline of the trial:

I-SPY 2 will compare the efficacy of novel drugs in combination with standard chemotherapy with the efficacy of standard therapy alone. The goal is to identify improved treatment regimens for patient subsets on the basis of molecular characteristics (biomarker signatures) of their disease. As described for previous adaptive trials, regimens that show a high Bayesian predictive probability of being more effectice than standard therapy will graduate from the trial with their corresponding biomarker signature(s). Regimens will be dropped if they show a low probability of improved efficacy with any biomarker signature. New drugs will enter as those that have undergone testing are graduated or dropped.

I-SPY 2 has two control arms corresponding to the standard of care, with an additional drug used for patients with human epidermal growth factor receptor 2 (HER2). Additional arms test five experimental drugs added to the standard therapy. Three biomarkers are used to categorize patients into fourteen subgroups, in which they are randomized to one of the treatments. Adaptive randomization is used to alter the biomarker-stratified randomization ratios between the arms in response to observed outcomes. If the Bayesian predictive probability of success in a phase III trial is high, that drug is graduated from the study to a confirmatory trial. If the predictive probability of success is low, it is dropped from the study. In late 2013 and early 2014, I-SPY 2 graduated two test treatments, for which phase III trials are being prepared (*Cancer Discovery*).

### Conclusion

It is evident that there is a diverse set of biomarker-guided designs, each with different strengths and weaknesses. For example, while biomarker-strategy designs tend to be statistically inefficient, they may in some cases be the only ethical option; additionally, while enrichment designs offer efficiency benefits over stratified designs in terms of sample size, they do so at the expense of enrollment speed. Adaptive designs add another layer of flexibility (and complexity) on top of the basic biomarker-guided designs, allowing for alterations and decisions to be made at interim analyses. Finally, real-world examples of biomarker-guided trials combine aspects of multiple designs and adaptations, with mixed degrees of success.