Published in The British Medical Journal - 27th August 2020

In many non-randomised experiments, researchers are interested in assessing how change in health status is associated with a covariate of interest. While there is much guidance available on assessing change in randomised experiments, and extensive discussion with respect to efficiency and bias,1–9 the guidance in non-randomised studies is less clear. The principle difference is that in observational studies we do not expect balance between different levels of an exposure at baseline, in addition to expecting imbalance in other confounding factors. Glymour *et al* advocate the use of, Simple Analysis of Change Scores (SACS) without baseline adjustment to achieve unbiased causal effect estimates using causal arguments presented through Directed Acyclic Graphs.10 They briefly suggest that in settings with floor and/or ceiling effects, that standard change analyses with and without baseline adjustment are both biased, and non-standard analyses based on Tobit models (censored regression) may ameliorate floor and ceiling problems. The degree to which Tobit models ameliorate the problems caused by floor and ceiling effects is unclear. Some authors suggest that using percentage change is one strategy to avoid dealing with floor and ceiling effects, but Twisk highlighted that this simply represents a linear transformation of change,11 and therefore, does not deal with the problem of floor and ceiling effects. Twisk and Rijmen also describe the use of a longitudinal (multilevel) Tobit regression model to appropriately account for floor and ceiling effects in studies with repeated measures.12 However, since its publication in 2009, there have only be a handful of analyses that use multilevel Tobit models (MLTMs),13–16 suggesting that lack of familiarity with these methods or understanding of when they can and should be applied has deterred analysts in their use, or when they can be applied.

MLTMs are now incorporated in mainstream statistical software packages, such as Stata version 15. Given their accessibility, they could arguably be used more frequently than they are. This is relevant considering that the use of measurement instruments with floor and ceiling effects are omnipresent in health-related research. Examples include outcomes in health-related quality of life (eg, EQ-5D, SF-36 and SF-12), psychological well-being (eg, Hospital Anxiety and Depression Scale, Edinburgh Postnatal Depression Scale) and disease-specific measures of well-being (eg, Western Ontario and McMaster Universities Osteoarthritis Index and Oxford Hip Score (OHS) as used in patients with osteoarthritis (OA)). Despite this, there is very little guidance available with respect to the consequences of using measurement instruments with floor or ceiling effects, when attempting to make inferences about the effect of an exposure on the change (between two time points) of an outcome of interest.

In this paper, we use a Monte Carlo (MC) simulation study to compare the performance of multilevel linear and Tobit models, ordinarily least squares (OLS) regression and single-level Tobit regression, with and without adjustment for baseline scores, in the analysis of change in three different non-randomised experiments and a randomised experiment. We also demonstrate the use of these models using real world data from a large national joint replacement register.

We motivate the simulation and exemplar data analysis using an example from joint replacement research describing the association between body mass index (BMI) and the change in a disease specific patient-reported outcome measure (PROM), the OHS. The issue is contentious in the UK17–19 and USA20 as some organisations suggest restricting joint replacement to patients based on their BMI, citing an increased risk of revision surgery and lack of efficacy of surgery. The small increase in absolute risk of revision in obese patients, must be balanced against the other benefits of joint replacement, including a reduction in pain and improved physical functioning. Therefore, it is of interest to clinicians, policy-makers and patients to know the relative effect of obesity on the efficacy of total hip replacement (THR) compared with ‘normal weight patients’.

We investigated the performance of four different methods of analysis, when estimating the effect of an exposure (BMI) on change in response (PROM) before and after THR with floor and ceiling effects using the Aims, Data Generating Process (DGP), Methods, Estimand, Performance approach recommended by Morris *et al*.21

We simulated longitudinal data of ‘well-being’ before and after surgery. We assume that ‘well-being’ is a latent, truly continuous and stable construct which is measured imperfectly by the OHS. Measurement error and floor/ceiling effects are then added to the latent construct to illustrate their consequences.

We assume the response, well-being, is a latent construct

measured at the *i ^{th}* occasion, where

is mean-centred BMI categories according to WHO criteria, that is, −2= BMI <18.5 kg/m^{2} (under weight), −1= 18.5<BMI≤25 kg/m^{2} (normal), 0 = 25<BMI≤30 kg/m^{2} (overweight), 1 = 30<BMI≤35 kg/m^{2} (obese), and 2= BMI >35 kg/m^{2} (morbidly obese), i.e.

= 0 is a patient with a BMI classed as overweight.

where

is the time at which measurement i was taken on individual j , coded as 0 at pre- and 1 post-surgery.

is the baseline population average response for a patient with average BMI, and

represents the *j ^{th}* individual difference from the baseline response. The sum of

is the individual baseline response for a patient with average BMI.

represents the population average change per unit increase in time for a patient with average BMI, and

represents the *j ^{th}* individual difference from the population average change per unit increase in time. The sum

is the individual average change per unit increase in time for a patient with average BMI.

represents the effect of a 1-unit increase in the exposure (

) of interest (BMI) pre-surgery and

represents the effect of a 1-unit increase in BMI (

) on the pre- post-surgery change in well-being (

). The variance in individual deviations from the population average response at baseline and the average rate of change are

and

, respectively. The covariance between baseline measurements and rate of change is characterised by

(with correlation

).

Under the assumption of linear change, data were simulated from a multilevel model (MLM) with a random intercept and slope, see figure 1 for an illustration of a patient trajectory with an average BMI.

The observed response without floor and ceiling effects (

) is simulated by adding measurement error in the linear trajectory,

, to the latent response, where

.

A response with floor and ceiling effects

is simulated by restricting the response to lie between 0 and 48.

See figure 2 for a graphical illustration of the trajectory generation: we first simulate

, then add some measurement error (

) to yield an observed response (

), and finally add floor and ceiling effects to obtain the observed truncated response (

).

We compared four DGPs to illustrate a range of scenarios by manipulating

,

and

to influence the association between pre-surgery and post-surgery outcomes.

,

,

,

, and

were fixed at 10, 40, 10, 15 and 3, respectively. DGP 1 is a null model, where there is a baseline effect of the exposure is

=−3, but the exposure did not influence change over time (

=0), and there is no correlation between baseline values and subsequent change (

=0).

DGP 2 replicates a simple randomised trial where there is no difference between levels of the exposure at baseline (

=0), but the exposure did influence change over time (

=−3), and there is no correlation between baseline values and subsequent change (

=0). DGP 3 and DGP 4 replicate a cohort study, where there is a difference between levels of the exposure at baseline (

=−3), and the exposure also influenced change over time (

=−3). DGP 3 specified no correlation between baseline values and subsequent change (

=0), whereas DGP 4 specified a negative correlation between baseline values and change (

=−0.5), reflecting the fact the joint replacement surgery has the tendency to normalise an individuals well-being, see figure 3 for an illustration of the associated trajectories.

We conducted an MC simulation with 1000 replicated datasets, each with 10 000 patients. A balanced dataset, that is, three data points for each individual, was simulated to ensure identification of the linear and Tobit MLMs occurred, that is, two data points allow estimation of baseline and change parameters but not measurement error. The middle data point was then dropped to replicate a pre- post- study design.

For data sets with three measurement occasions, a linear MLM and an MLTM that reflects the DGP were fitted to the data, see equation 1.

In datasets with two measurement occasions, that is, a pre- post- study design, single-level OLS and Tobit models were fitted to the data. Tobit models were only used when floor and ceiling effects had been simulated. Three different models were explored:

1. A simple model for post-surgery well-being.

2. A SACS.

3. A model for change adjusted for baseline, that is, baseline adjusted analysis of covariance (ANCOVA). This model is equivalent to a model for the post-score adjusted for baseline ANCOVA, with the exception of the interpretation of the intercept.

In addition, an underidentified MLTM, equivalent to equation 1with constrained error variance

was fitted in the spirit of a sensitivity analysis, where

was constrained to a value from 5, 10, 15, 20, 25 and 30.

The estimand of interest is the population average effect of the interaction between the exposure and change in slope, that is,

the pre- post-surgery change in well-being. We test whether the exposure modifies the improvement post-surgery (ie, the null hypothesis that

).

The performance of each method was explored in terms of bias, coverage, empirical SE, model-based SE, mean square error, relative error and relative precision.

Using data from the National Joint Registry (NJR), we investigated the association between BMI and a PROM, the OHS, in patients undergoing elective THR between 1 April 2003 and 22 February 2017.

The NJR commenced data collection in April 2003; at inception it was mandatory for all THRs conducted in the private sector to be entered into the NJR, and from 2011 all THR procedures in the public and private sector were required to be entered into the NJR. A recent national audit of data entered into the NJR between 2014 and 2015 estimated data capture of 95% for primary THR and 91% for revision THR.

All consenting patients undergoing THR were eligible to be included in the analysis. Patients were included if their patient history was unique and consistent, that is, contained no duplicates, revision prior to primary, or currently held in query by the submitting unit. Due to the requirement for reliable date information, patients who were indicated to have died prior to undergoing a procedure, were more than 110 years of age, had undergone a procedure prior to their date of birth, or received a procedure prior to 2003 were excluded from the analysis. Only primary THRs, where the primary indication for operation was OA with unique prosthesis combinations were included in the analysis. All THRs with metal-on-metal bearing combinations were excluded from the analysis due to the exceptionally high failure rate in this group.22 23 Patients who were less than 50 years of age at the date of the index THR were also excluded, due to the high likelihood that these cases are due to OA secondary to other pathology.

See online supplementary figure 1 for a detailed breakdown of inclusion criteria.

The primary exposure of interest in this study is BMI. BMI was introduced into the second ‘Minimal Data Set’ in 2004. Patients with BMI between 10 and 60 were included in the analysis. BMI measures were excluded as implausible if height and weight measures were less than 130 cm and weight less than 30 kg, respectively. See online supplementary figure 2.

The primary outcome of interest in this study is change in OHS after surgery. Linked National PROMs were first available in 2009, see online supplementary figure 3 for details of linkage.

Preoperative confounding factors were thematically organised into groups: (1) Patient factors included sex, American Society of Anesthesiologists grade and operation funder; (2) Operation factors included fixation, approach, patient position during surgery, anaesthetic type, thromboprophylaxis regimen, bearing and year of primary THR; (3) The setting of the treatment episode (ie, private or National Health Service hospital); (4) Consultant-based factors included the training status of the primary surgeon performing the operation and (5) Deprivation factors were based on the English indices of multiple deprivation (an area-based index of deprivation).

Means, SD and IQR points were used to describe continuous variables. Frequencies and percentages were used to describe categorical variables.

The association between change in PROMS score was investigated using the same single-level methods and the ML Tobit model with constrained error variances described in the simulation study as an exemplar. In addition, we conducted more comprehensive analyses using restricted cubic splines (RCS) to model the BMI association in the ML Tobit model with constrained error variance, single-level linear and Tobit SACS, ANCOVA and post-score models. In the ML Tobit model, BMI was modelled with RCS at baseline and its interaction with time. Correspondingly, we adjusted OHS for patient and deprivation confounding factors at baseline and operation, setting and confounding factors with an interaction with time, that is, operative factors and settings influence the change in outcome but not the baseline response. In single-level models, the effect of BMI was modelled using RCS and adjusted for confounding factors using standard regression approaches.

Due to the method of data collection in the national PROMS programme, item non-response is masked. De facto mean imputation of up to two missing items in the OHS occurred automatically. In addition, despite valid values appearing with individual OHS items, if the questionnaire was marked as ‘not complete’, implausible overall scores were obtained. For simplicity, only patients with complete pre-operative and post-operative PROMS were used in the analysis. BMI is missing in a substantial proportion of the cohort. Patients prior to 2004 did not have BMI recorded, and the proportion of patients with missing BMI in 2004 is large. In 2009, ~40% of patients did not have BMI recorded; this reduced year on year and in 2016 was ~18% of eligible patients.

For pedagogical simplicity, we use complete-case analyses throughout.

Patient representatives sit on the committee structure of the NJR. The research priorities of the NJR are identified by this committee structure and approved by the patient representatives. Patients were not involved in the setting of the research question or the outcome measures, nor were they involved in designing or implementing this work or interpretation of the results. We are unable to disseminate results of this study directly to study participants due to the anonymous nature of the data. We plan to disseminate our findings to the NJR, via their communications team, to relevant individuals with regard to the provision of joint replacement and to the general population through the local and national press.

The results of the simulation study clearly illustrate that, in the presence of floor and ceiling effects, neither baseline adjustment, or SACS will yield unbiased estimates of the effect of an exposure on the outcome of interest. Single-level modifications to account for floor and ceiling effects such as the Tobit model only work in the context of a randomised trial, that is, when there is no difference between baseline values by BMI. Importantly, single-level methods, OLS and Tobit models, induce significant bias, with negligible coverage, when

, that is, there is no change in the pre- post-surgergy well being by BMI. Fully identified MLTM with three measurement occasions, return unbiased estimates with coverage close to advertised levels. In pre- post-designs with two measurement occasions ML Tobit models, with constrained level 1 variances, return estimates very close to being unbiased, but coverage is less than advertised indicating bias in the model based SEs.

The simulation study is consistent with a lay intuition with respect to analyses of floor and ceiling effects. Assuming we accept that either the MLM and OLS change analyses are appropriate in the absence of floor and ceiling effects, DGP 1 illustrates that when there is no effect of obesity on the efficacy of surgery, the addition of an artificial ceiling compresses the gain of individuals towards the top of the distribution. Due to the baseline association between obesity and well-being, underweight individuals tend to have gains that are more compressed compared with obese individuals. This inevitably induces bias, and provides evidence of a change in presurgery and post-surgery well-being by BMI, where none actually exist. Similarly, in DGP 2 (no baseline differences) where there is truly an interaction effect, will also lead to biased estimates. The DGP used in the simulation assumes underweight individuals benefit more from surgery than heavier individuals, which results in a fanning out of the trajectories. Underweight individuals have truly greater gains than obese individuals, but these gains are underestimated due to the ceiling effect, resulting in bias towards the null. In DGPs 3 and 4 (baseline differences in BMI, and interaction between BMI and change), we see a more extreme pattern of results compared with DGP 2, but overall consistency with the expected response of compressing individual gains which have initially higher starting values.

In the exemplar analysis of NJR data, the pattern of results is very similar to that of DGP 1 of the simulation, suggesting that results of the simulation are likely to be replicated in real-world datasets. The more comprehensive analysis of the NJR data, using RCS to reflect the continuous nature of BMI, aptly illustrate where the effects from misspecified single-level models are arising from. The ML Tobit model illustrates a strong negative association between BMI and pre-operative OHS, and failing to account for these baseline differences appropriately when attempting to estimate change leads to variation at baseline being incorporated in the estimate of change. Furthermore, the ability to adjust both baseline and post-surgical OHS for their pronounced floor and ceiling effects, respectively, leads to unbiased estimates of the effect of interest. Unfortunately, due to the constraints on the level 1 variance, interpretation of the random effects are difficult, as they depend on the magnitude of the variance applied in the constraint, see online supplementary figure 16. However, the models clearly illustrate that change in PROMS following THR do not depend on BMI, and surgery appears to be effective for patients regardless of their BMI.

Floors and ceilings in PROM instruments have somewhat predictable effects on estimated coefficients from standard OLS models that do not adjust for floor or ceiling effects, assuming the true underlying association is known. As this is rarely the case, it is important to consider a variety of different DGP to explore the likely impact on an analysis. It is important to consider the validity of the assumptions underpinning the Tobit model, that is, that the latent response is truly continuous and that there is a true ceiling just beyond the range of the measurement being used.

Single-level Tobit models do not ameliorate floor and ceiling effects in SACS. However, ML Tobit models appear to recover the effects of interest under specific assumptions. The analyses of pre- post-designs require further constraints to ensure models are fully identified. The difference between analytical approaches can profoundly alter the interpretation of the model parameters, and this may have serious consequences if used to generate policy inappropriately. For example, inappropriate analyses that fail to consider DGP appropriately may lead to the restriction of joint replacement for overweight or obese patients.

When designing a study to investigate the effect of an exposure on change in health status, it would be preferable to use a measurement instrument that does not have floor or ceiling effects as inference is less complicated, and design trumps analysis in most scenarios. If the use of measurement instrument with floor and ceiling effects is unavoidable, it is preferable to collect data at three time points which ensure models are fully identified, alleviating the need to constrain level 1 variance in order to identify models, again design trumps analysis. If retrospective analysis of pre- and post-data sets are required, it appears that using ML Tobit model with constrained level 1 error variance would be preferable to single-level approaches.

Broadly speaking the analyses of this simulation are in agreement with the work of Glymour *et al*, that analysis of change and its interaction with an exposure at baseline, should not be adjusted for baseline measurements in observational data. The presence of floor and ceiling effects in data requires additional assumptions which makes things marginally more complex.