95
rate (pu) and another intermediate (p7) for a good prognosis success rate (p3). Figure
4.3 (b) shows that the HLRM fails to accurately estimate the swapped probabilities.
The fixed grouping (by covariate) is inappropriate for these subtypes, leading to poor
estimates of the corresponding success rates. Scenario S3 favors the NEPPM. The
grouping is exact but the monotonicity assumed by the HLRM is violated. Figure 4.3
(c) shows how the NEPPM outperforms the HLRM in this scenario. The success rate
estimates under the NEPPM have the favorable bias and MSE. The HLRM performs
poorly in terms of CP even for the intermediate prognosis success rates in subtypes
with large sample sizes Ni. In S4 the grouping by prognosis is inappropriate and
the monotonicity assumption of the HLRM is violated. As Figure 4.3 (d) shows, the
NEPPM performs better than the HLRM in this scenario.
4.4.2 Average Sample Size and Stopping Probabilities
We continue the comparison of NEPPM versus HLRM. We now focus on summaries
that are relevant for early stopping for futility. We will accrue patients by cohorts of
10. We will stop recruiting patients for sarcoma subtype i and cancel the г-th study
arm if
Pr∖pi > 0.175 I data so far] < 0.10. (4.11)
Already accrued data for canceled study arms will continued to be used in the infer-
ence for other sarcoma subtypes, i.e., it remains part of the data set.
For both models we continue to use the same hyperparameters as in the previous
section. The (maximum) sample sizes Ni, i = 1,..., n, remain as before. The best
model should accrue the smallest number of patients in study arms for which the
treatment is inefficient, that is when the success rate pi is less than 0.175.
Results are summarized in Figure 4.4. Each panel reports the average number of
patients in each study arm (Ni), and for each study arm, the probability of early stop-
ping (pi). Like the reported bias and MSE in the earlier comparison, both summaries,
Ni and pi, are with respect to repeat experimentation, i.e., they are an expectation