wlsd: Wrangling Longitudinal Survival Data

Charles Ingulli

Introduction

The wlsd (wrangling longitudinal survival data) package supports processing of repeated observation data for model building in survival analysis. A longitudinal study or panel study is a repeated observation method by which subjects are monitored over time (Hsiao 2003). This type of study sees application in analyzing disease progression such as heart deterioration after transplant (Jackson 2024). When preparing raw data into a format suitable for survival analysis models, knowledge of the appropriate format for the model is needed as well as the time and ability to transform the data. The functions available in this package automate some of the preparation steps to simplify and reduce the time spent formatting.

Some longitudinal data formats might include the long format or the counting process format. The first table below shows a longitudinal data set in long format.

Long Format Data
id time event var1 var2
1 0 0 10.4 10
1 31 0 11.3 10
1 64 0 12.7 10
1 96 1 17.5 10
2 0 0 1.2 25
2 33 0 5.9 25
2 59 1 4.4 25
3 0 0 10.6 16
3 28 1 8.0 16

Information in the table above is for three subjects with identifiers denoted by id. Each row corresponds to a particular observation for a subject with time recording the occurrence of the observation. Here, each subject starts at time 0 and every subsequent value represents the number of time units since time 0. There are 3 additional variables; event and var1 vary with time while var2 is constant over time. This format is easily legible and can be quickly filtered. Using the first row as an example for interpretation, at time 0 subject 1 had an event value of 0 and var1 value of 10.4 with a var2 value of 10.

The next table shows the same data in counting process format.

Counting Process Format
id time1 time2 event var1 var2
1 0 31 0 10.4 10
1 31 64 0 11.3 10
1 64 96 1 12.7 10
2 0 33 0 1.2 25
2 33 59 1 5.9 25
3 0 28 1 10.6 16

The biggest difference of this format is the way time is defined which is column-wise instead of row-wise. In that regard, there are two columns for a start time and a stop time of an interval, namely time1 and time2. Mathematically, the interval is considered open left and closed right. That is, \((time1, time2] = \{x|time1 < x \leq time2\}\). Therefore, multiple time intervals for a particular subject will not overlap. Any time related covariates are interpreted over the interval while any status indicator variables are interpreted at the end of the interval. Looking at the first row in the above table, event and var1 are the two columns for the time dependent variables with event being a status indicator and var1 being a covariate. The first value of var1 is 10.4 so subject 1 has this value over the interval from time 0 to time 31. Since the value of event is 0, this subject does not have an event at the end of this interval. Looking at the second interval for subject 1, the value of var1 is 11.3 over the interval from time 31 to time 64 and the subject again does not have an event at the end of the interval. Subject 1 has an event at time 96 which is the end of the third interval.

Note that there is a loss of some information transitioning from the long format to the counting process format as the time interval formulation means that the first status indicator value and last covariate value will be omitted. For subject 1, this is the first value of 0 for event at time 0 and the last value of var1, 17.5, at time 96. A detailed description of the counting process format is given by Collett (2015) in its use for survival analysis. Therneau, Crowson, and Atkinson (2020) also gives a description of the counting process format with use case examples on how it is leveraged in survival analysis.

Data Transitions

The figure below depicts data formats with possible transitions in the package.

Format transitions for wlsd.

Format transitions for wlsd.

The solid arrows represent explicitly defined transitions that have their own function while the single dashed arrow represents a transition that is possible but does not have its own function for use. This transition is discussed a bit more in the “Aggregating Longitudinal Data to Count Format” section below.

Long Format to Counting Process Format

The long2cp() function is used to transition data from long format to counting process format. It requires arguments for data, id, time and status. The data argument is used for the data.frame object for transition while id is the column name for the subject identifier. The time argument is for the column name of the single time column that is to be transitioned to an interval across columns while the status argument is used for any status indicator variables that need to be transitioned accordingly. The value of any status indicator variables is assumed to be associated with the end point of the interval. So, any column names supplied to the status argument are transitioned under this assumption. The result is the first value is dropped for each subject. All other columns are processed such that the value is associated with the first time point of the interval. This results in the last value for each subject being dropped. For constant columns, this is irrelevant but for any other time dependent covariates that are not named in the status argument, this is important to note.

The code below shows an example using the long format data from earlier.

long2cp(data = long_data, id = "id", time = "time", status = "event")
#>   id time1 time2 event var1 var2
#> 1  1     0    31     0 10.4   10
#> 2  1    31    64     0 11.3   10
#> 3  1    64    96     1 12.7   10
#> 4  2     0    33     0  1.2   25
#> 5  2    33    59     1  5.9   25
#> 6  3     0    28     1 10.6   16

Looking at the output above for the subject with id of 1, since the event column was supplied to the status argument then the values associated with the end points of that column are kept. All other time dependent covariates keep values associated with the starting points. In this case, that is the var1 column. Using row 3 as an example, individual 1 experienced the event at time 96 which is kept as the value for event. However, their last recorded value for var1 was 12.7 at time 64. This is then used for the 3rd row as 12.7 is used over the interval from time 64 to time 96.

By default, the function will keep all subjects through the drop = FALSE argument. However, it may be desirable to drop individuals that only have 1 row of data in long format as they do not have enough information for time intervals. Consider the dropped_long data set below which is missing the 2nd observation for id 3. This subject has only one time point so a start and stop time interval is not really feasible. However, to maintain the data set that is input to long2cp(), the subject is kept in the output data set with a time interval that starts and stops at the one time point given. All covariates are then associated with that one point.

dropped_long <- long_data[1:8,]
long2cp(dropped_long, id = "id", time = "time", status = "event", drop = "FALSE")
#>   id time1 time2 event var1 var2
#> 1  1     0    31     0 10.4   10
#> 2  1    31    64     0 11.3   10
#> 3  1    64    96     1 12.7   10
#> 4  2     0    33     0  1.2   25
#> 5  2    33    59     1  5.9   25
#> 6  3     0     0     0 10.6   16

If the subject is intended to be dropped for analysis purposes, then the drop = TRUE argument can be used to exclude these rows from the output data set.

long2cp(dropped_long, id = "id", time = "time", status = "event", drop = "TRUE")
#>   id time1 time2 event var1 var2
#> 1  1     0    31     0 10.4   10
#> 2  1    31    64     0 11.3   10
#> 3  1    64    96     1 12.7   10
#> 4  2     0    33     0  1.2   25
#> 5  2    33    59     1  5.9   25

The single row for id value 3 is missing in the output above.

Counting Process Format to Long Format

The transition from counting process format to long format is done through the cp2long() function. Arguments are similar to the previous function but take two time points as input instead of one. This is to consolidate the two end points of the interval into a single column.

cp2long(data = cp_data, id = "id", time1 = "time1", time2 = "time2", status = "event")
#>   id time event var1 var2
#> 1  1    0    NA 10.4   10
#> 2  1   31     0 11.3   10
#> 3  1   64     0 12.7   10
#> 4  1   96     1   NA   NA
#> 5  2    0    NA  1.2   25
#> 6  2   33     0  5.9   25
#> 7  2   59     1   NA   NA
#> 8  3    0    NA 10.6   16
#> 9  3   28     1   NA   NA

Note that in this example there are a few NA values which differ from the example counting process table. This is due to the missing information about the first time covariates mentioned earlier. By default, the function will not make any assumptions about this data and will simply leave NA values for any unknown values. Therefore, any column names supplied to the status will be missing the first value for each id as these are assumed to be associated with the endpoint of the interval so there is no value for the starting point of the first interval. Similarly, any other time dependent columns will be missing the last value of each id as they are associated with the start time point over the whole interval so the last value of the last interval will be missing. If desired, the fill = TRUE option may be used in order to insert values for any constant columns that might result in an NA due to this reasoning.

cp2long(cp_data, id = "id", time1 = "time1", time2 = "time2", status = "event", fill = TRUE)
#>   id time event var1 var2
#> 1  1    0    NA 10.4   10
#> 2  1   31     0 11.3   10
#> 3  1   64     0 12.7   10
#> 4  1   96     1   NA   10
#> 5  2    0    NA  1.2   25
#> 6  2   33     0  5.9   25
#> 7  2   59     1   NA   25
#> 8  3    0    NA 10.6   16
#> 9  3   28     1   NA   16

Here, the var2 column was constant and NA values were populated with more relevant values.

Aggregating Longitudinal Data to Count Format

It may be useful to aggregate longitudinal data into count format with tallies of repeated observations. Transitions to the count format are done through the long2count() function. While originally conceived to perform the transition from long format to count format, the logic here is the same as that in the counting process format to count format so this function can be used for both transitions. The major difference between the long format and counting process format is how time is organized so the logic of aggregating any covariates row-wise should be the same but the aggregated values will be different across formats due to the 1 value loss of information from long format to counting process format.

The long2count() function requires a data set and identifier column to perform aggregation operations. It also requires an argument to either the event or state argument. This will determine how to treat events of interest under the observation scheme of the data set. The event argument will treat the given column or columns as numeric counts of observations while the state argument will treat the column largely as a factor with different levels.

Consider the first example below with the long format data and the single event column called event.

long2count(long_data, id = "id", event = "event")
#>   id     time      var1 var2 event.counts count.weight
#> 1  1 47.75000 12.975000   10            1            4
#> 2  2 30.66667  3.833333   25            1            3
#> 3  3 14.00000  9.300000   16            1            2

Since the argument supplied to event was a count of occurrences, the output column is a sum of rows. Any columns that are constant within the id argument grouping are carried over to the output as is while any that are time varying (or change row-wise) are then aggregated into the final count format output.

The default aggregation function for any time varying covariate is mean() but this can be changed through the FUN argument. For example, setting FUN = max will take the maximum value over time and use that as value in the aggregated count data set.

long2count(long_data, id = "id", event = "event", FUN = max)
#>   id time var1 var2 event.counts count.weight
#> 1  1   96 17.5   10            1            4
#> 2  2   59  5.9   25            1            3
#> 3  3   28 10.6   16            1            2

For multiple events, the concatenated names of event columns can be supplied to the event argument.

long_data2 <- long_data
long_data2$event2 <- c(2,2,3,1,2,3,2,1,3)
long2count(long_data2, id = "id", event = c("event","event2"))
#>   id     time      var1 var2 event.counts event2.counts count.weight
#> 1  1 47.75000 12.975000   10            1             8            4
#> 2  2 30.66667  3.833333   25            1             7            3
#> 3  3 14.00000  9.300000   16            1             4            2

If the observational thing is to be treated as a state instead of an event, then the state argument will handle the aggregations as such. Consider the event column as a binary state such that “0” is coded as healthy and “1” is coded for unhealthy. The state argument will instead count the number of occurrences of each state.

long2count(long_data, id = "id", state = "event")
#>   id     time event      var1 var2 event.counts count.weight
#> 1  1 47.75000     0 12.975000   10            3            4
#> 2  1 47.75000     1 12.975000   10            1            4
#> 3  2 30.66667     0  3.833333   25            2            3
#> 4  2 30.66667     1  3.833333   25            1            3
#> 5  3 14.00000     0  9.300000   16            1            2
#> 6  3 14.00000     1  9.300000   16            1            2

For the example above, subject 1 now has two rows for each of the different levels of “event”. From the event.counts column, this subject has 3 occurrences of state 0 being healthy and 1 occurrence of state 1 being unhealthy. This can be expanded to larger state spaces with more detailed descriptions.

An argument is required to be supplied to either the event or state argument. The event argument will handle multiple columns as events but the state argument will only accept one column to be treated as a state. Combinations of event and state are allowed such as 1 state and 1 event or 1 state and multiple events.

Helper Functions

There exist a few other functions within the wlsd package to help with small niche cases that may arise in practice. They are all formalized functions based on preparations of the LBP data set (discussed later). The three notable functions are basedate(), events2state(), and takefirst(). The first, basedate(), creates a new row for each individual which can serve as their baseline or first visit and then be populated with new information. The events2state() function is meant to consolidate multiple columns of event observations to a single state column. That is, it aggregates multiple columns into a single factor-like column based on the combinations of values from the input columns. The takefirst() function takes the first number of rows for each subject up to and including a given criteria. These functions are discussed in greater detail in the “Applications” section below as the example context gives greater understanding to their use.

Applications

Example: Low-Back Pain Data Wrangling

In order to demonstrate the use cases of this package, consider this example using a low-back pain (LBP) data set that needs to be prepared for fitting to several models in a large survival analysis. The data set originally comes from Garg et al. (2013) which was used in Ingulli (2020) for a comprehensive survival analysis where some of this example originates. See the appendix section of Ingulli (2020) a larger discussion on data preparation for this set using some early versions of this package.

The data is available within the package via LBP. It consists of 466 individuals with associated low back pain statuses over a 6 year observation period. The first 5 rows and the first 10 columns of data are shown below.

LBP[1:5,1:10]
#>   sid Baseline.date       Date time_to_row case.lbp case.med case.sc case.lt
#> 1   1    2004-07-01 2004-08-23          53        0        0       0       0
#> 2   1    2004-07-01 2004-09-30          91        0        0       0       0
#> 3   1    2004-07-01 2004-12-01         153        0        0       0       0
#> 4   1    2004-07-01 2005-01-12         195        0        0       0       0
#> 5   1    2004-07-01 2005-02-15         229        0        0       0       0
#>   gender  age
#> 1      F 28.1
#> 2      F 28.1
#> 3      F 28.1
#> 4      F 28.1
#> 5      F 28.1

The data set is organized in long format with subject column sid, primary time column time_to_row, and several time varying LBP variables including case.lbp, case.med, case.sc, and case.lt. These last 4 variables represent the binary occurrence of LBP, LBP requiring medication, LBP requiring medical care, and LBP resulting in lost time from work, respectively, at the time of observation. See the help file via help(LBP) for more information.

At the time of construction for this particular data set, the baseline data was not available. That is, the first observations were missing. For analysis purposes, the assumption was made that all individuals would be free from all LBP at this time point and the analysis would continue. That assumed information can be added to the data set through the basedate() function. In the code below, a new row is added for each individual that can then be populated with the new LBP values.

LBPwBaseline <- basedate(LBP, "sid")
LBPwBaseline[1:5,1:10]
#>   sid Baseline.date  Date time_to_row case.lbp case.med case.sc case.lt gender
#> 1   1    2004-07-01    NA          NA       NA       NA      NA      NA      F
#> 2   1    2004-07-01 12653          53        0        0       0       0      F
#> 3   1    2004-07-01 12691          91        0        0       0       0      F
#> 4   1    2004-07-01 12753         153        0        0       0       0      F
#> 5   1    2004-07-01 12795         195        0        0       0       0      F
#>    age
#> 1 28.1
#> 2 28.1
#> 3 28.1
#> 4 28.1
#> 5 28.1

As baseline information is unavailable, there are a few data points that require information. The baseline date for all individuals is stored as a constant in another column, Baseline.date. This data point can be dragged over to the baseline Date column for each individual. Subsequently, we can fill in 0s for all other missing data points as we will assume that each individual starts the study without any instances of the case definitions which is indicated by 0 and the baseline visit which is the first observation is time time_to_row equal to 0.

LBPwBaseline$Date <- ifelse(is.na(LBPwBaseline$Date), LBPwBaseline$Baseline.date, LBPwBaseline$Date)
LBPwBaseline[is.na(LBPwBaseline)] <- 0

Any constant columns within subject groups are carried over to the new row while anything that changes over time or row-wise, is left as NA.

Through exploration, a few cases can be found where case.lbp is 0 while another “case.” column is 1.

LBPwBaseline[404:406,1:8]
#>     sid Baseline.date  Date time_to_row case.lbp case.med case.sc case.lt
#> 404  27    2004-05-25 13147         584        1        1       1       0
#> 405  27    2004-05-25 13179         616        0        1       1       0
#> 406  27    2004-05-25 13236         673        1        1       1       1

Intuitively, this does not make sense as these other cases are defined to be based on the subject having LBP and therefore case.lbp. This was most likely due to data entry errors for the case.lbp column so these are revalued to 1 if another case value is also 1.

LBPwBaseline$case.lbp <- ifelse(LBPwBaseline$case.lbp == 0 &
                               (LBPwBaseline$case.med == 1 | 
                                LBPwBaseline$case.sc == 1 | 
                                LBPwBaseline$case.lt == 1),
                                1, 
                                LBPwBaseline$case.lbp)
LBPwBaseline[404:406,1:8]
#>     sid Baseline.date  Date time_to_row case.lbp case.med case.sc case.lt
#> 404  27    2004-05-25 13147         584        1        1       1       0
#> 405  27    2004-05-25 13179         616        1        1       1       0
#> 406  27    2004-05-25 13236         673        1        1       1       1

As is, the data set is in a format that can be used for a multi-state model analysis such as that provided by the msm package. The package uses a state argument which is taken as a single column. The multiple LBP columns cannot be passed in their current form. The events2state() function can be used to consolidate those multiple columns into a single column based on the different combinations of values.

LBP_msm <- events2state(LBPwBaseline, events = c("case.lbp",
                                                 "case.med",
                                                 "case.sc",
                                                 "case.lt"))
#> Combination Levels: 0.0.0.0 1.0.0.0 1.1.0.0 1.0.1.0 1.1.1.0 1.0.0.1 1.1.0.1 1.0.1.1 1.1.1.1
#> Numbered Levels: 1 2 3 4 5 6 7 8 9

In the code above, the 4 given input columns have values which when combined give 9 total combinations. The combination levels are output to the console when the function runs. By default, these combinations are changed to numbers through the number = TRUE argument. The interaction() function handles the combinations of columns which by default drops unused combinations. This can be toggled in the drop argument by drop = FALSE.

The LBP_msm data set is suitable input for the msm package. This is largely the form used for the analysis in Ingulli (2020) which has a section on using the msm package for multi-state model analysis of this data set.

In order to transition the data set to counting process format, the long2cp() function is used below.

LBP_cp <- long2cp(LBPwBaseline, id = "sid", time = "time_to_row", status = c("case.lbp", "case.med", "case.sc", "case.lt"))
LBP_cp[1:5,1:10]
#>   sid Baseline.date  Date time1 time2 case.lbp case.med case.sc case.lt gender
#> 1   1    2004-07-01 12600     0    53        0        0       0       0      F
#> 2   1    2004-07-01 12653    53    91        0        0       0       0      F
#> 3   1    2004-07-01 12691    91   153        0        0       0       0      F
#> 4   1    2004-07-01 12753   153   195        0        0       0       0      F
#> 5   1    2004-07-01 12795   195   229        0        0       0       0      F

Any subjects with 1 row are dropped since they do not have enough time information for the desired analysis. The survival package contains analysis models that take data in counting process format as input. The LBP_cp data set has multiple occurrences of LBP so a repeated event or recurrent model might be more suitable.

If a simple Cox proportional-hazards model for the time to first instance of LBP is desired, all observations after the first occurrence of LBP can be truncated using the takefirst() function.

LBP_cp2 <- takefirst(LBP_cp, id = "sid",
                     criteria.column = "case.lbp",
                     criteria = 1)

LBP_cp2[11:14,1:8]
#>    sid Baseline.date  Date time1 time2 case.lbp case.med case.sc
#> 11   1    2004-07-01 13014   414   455        0        0       0
#> 12   1    2004-07-01 13055   455   517        1        1       0
#> 13   2    2004-07-01 12600     0    11        0        0       0
#> 14   2    2004-07-01 12611    11    53        1        1       1

In this form, LBP_cp2 has all rows up to and including the first time case.lbp is 1 for each subject. From this point, the data can be supplied to the coxph() function in survival.

The transitions to count format are done through long2count(). Only the LBP related values and time itself vary row-wise. As events, the LBP columns can be counted through the event argument as demonstrated below.

LBP_count <- long2count(LBPwBaseline, "sid", event = c("case.lbp",
                                                 "case.med",
                                                 "case.sc",
                                                 "case.lt"))

LBP_count[1:5,c(1,20:24)]
#>     sid case.lbp.counts case.med.counts case.sc.counts case.lt.counts
#> 1     1               2               1              0              0
#> 112   2               6               2              1              1
#> 223   3              15               2              1              0
#> 334   4               5               0              0              0
#> 412   5               9               0              0              0
#>     count.weight
#> 1             18
#> 112           16
#> 223           17
#> 334           18
#> 412           16

If looking at the LBP_msm data set with the state column, the state argument can be used and we can sum all the LBP variables to count their occurrences over time.

LBP_state <- long2count(LBP_msm, "sid", state = "state", FUN = sum)
LBP_state[1:9,c(1,5:8,24:26)]
#>   sid case.lbp case.med case.sc case.lt state state.counts count.weight
#> 1   1        2        1       0       0     1           16           18
#> 2   1        2        1       0       0     2            1           18
#> 3   1        2        1       0       0     3            1           18
#> 4   1        2        1       0       0     4            0           18
#> 5   1        2        1       0       0     5            0           18
#> 6   1        2        1       0       0     6            0           18
#> 7   1        2        1       0       0     7            0           18
#> 8   1        2        1       0       0     8            0           18
#> 9   1        2        1       0       0     9            0           18

In the case of using the state argument, all 9 states are counted for each subject. The other LBP columns are summed getting the total number of occurrences for each column. Either count format data set should be suitable for some sort of count regression model. The former might offer a more straightforward approach relating a single count to covariates of interest. The latter data set might be suitable with a random effect and a reduced state space. See Ingulli (2020) for an example analysis using count regression on only the counts of case.lbp.

References

Collett, David. 2015. Modelling Survival Data in Medical Research. Chapman; Hall/CRC.
Garg, Arun, Kurt Hegmann, J. Moore, Jay Kapellusch, Matthew Thiese, Sruthi Boda, Parag Bhoyar, et al. 2013. “Study Protocol Title: A Prospective Cohort Study of Low Back Pain.” BMC Musculoskeletal Disorders 14 (84): 84. https://doi.org/10.1186/1471-2474-14-84.
Hsiao, Cheng. 2003. Analysis of Panel Data. Cambridge University Press.
Ingulli, Charles. 2020. “A Survey of Statistical Methods for Investigating Risk of Low Back Pain in a Cohort of Manufacturing Workers.” Master’s thesis, American University.
Jackson, Christopher. 2024. “Multi-State Modelling with r: The Msm Package.” https://cran.r-project.org/package=msm.
Therneau, Terry. 2020. “A Package for Survival Analysis in r.” https://cran.r-project.org/package=survival.
Therneau, Terry, Cindy Crowson, and Elizabeth Atkinson. 2020. “Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model.” https://cran.r-project.org/package=survival.