---
title: "wlsd: Wrangling Longitudinal Survival Data"
author: "Charles Ingulli"
output: 
  rmarkdown::html_vignette
bibliography: wlsd.bib
nocite: '@*'
vignette: >
  %\VignetteIndexEntry{wlsd: Wrangling Longitudinal Survival Data}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  echo = TRUE, # prevents code from appearing in knit but not results
  collapse = TRUE, # separates code and output are separated into different chunks?
  comment = "#>" # I think this determines the text at the start of a line
)
#load("../data/LBP.rda")
#devtools::load_all()
library(wlsd)
```


# Introduction


The `wlsd` (wrangling longitudinal survival data) package supports processing of repeated observation data for model building in survival analysis. A longitudinal study or panel study is a repeated observation method by which subjects are monitored over time [@panelbook]. This type of study sees application in analyzing disease progression such as heart deterioration after transplant [@msmvignette]. When preparing raw data into a format suitable for survival analysis models, knowledge of the appropriate format for the model is needed as well as the time and ability to transform the data. The functions available in this package automate some of the preparation steps to simplify and reduce the time spent formatting.


Some longitudinal data formats might include the long format or the counting process format. The first table below shows a longitudinal data set in long format.


```{r long-tab, echo = FALSE, results = "asis"}
knitr::kable(long_data, caption = "Long Format Data", label = "long-tab")
```

Information in the table above is for three subjects with identifiers denoted by `id`. Each row corresponds to a particular observation for a subject with `time` recording the occurrence of the observation. Here, each subject starts at time 0 and every subsequent value represents the number of time units since time 0. There are 3 additional variables; `event` and `var1` vary with time while `var2` is constant over time. This format is easily legible and can be quickly filtered. Using the first row as an example for interpretation, at `time` 0 subject 1 had an `event` value of 0 and `var1` value of 10.4 with a `var2` value of 10.



The next table shows the same data in counting process format.

```{r cp-tab, echo = FALSE, results = "asis"}
knitr::kable(cp_data, caption = "Counting Process Format")
```

The biggest difference of this format is the way time is defined which is column-wise instead of row-wise. In that regard, there are two columns for a start time and a stop time of an interval, namely `time1` and `time2`. Mathematically, the interval is considered open left and closed right. That is, $(time1, time2] = \{x|time1 < x \leq time2\}$. Therefore, multiple time intervals for a particular subject will not overlap.
Any time related covariates are interpreted over the interval while any status indicator variables are interpreted at the end of the interval. Looking at the first row in the above table, `event` and `var1` are the two columns for the time dependent variables with `event` being a status indicator and `var1` being a covariate. The first value of `var1` is 10.4 so subject 1 has this value over the interval from time 0 to time 31. Since the value of `event` is 0, this subject does not have an event at the end of this interval. Looking at the second interval for subject 1, the value of `var1` is 11.3 over the interval from time 31 to time 64 and the subject again does not have an event at the end of the interval. Subject 1 has an event at time 96 which is the end of the third interval.



Note that there is a loss of some information transitioning from the long format to the counting process format as the time interval formulation means that the first status indicator value and last covariate value will be omitted. For subject 1, this is the first value of 0 for `event` at time 0 and the last value of `var1`, 17.5, at time 96.
A detailed description of the counting process format is given by @collett in its use for survival analysis. @time-vignette also gives a description of the counting process format with use case examples on how it is leveraged in survival analysis.

## Data Transitions

The figure below depicts data formats with possible transitions in the package.

```{r transition_graphic, echo=FALSE, out.width="50%", fig.cap="Format transitions for wlsd."}
knitr::include_graphics("../man/figures/wlsd-format-drawing.png")
```



The solid arrows represent explicitly defined transitions that have their own function while the single dashed arrow represents a transition that is possible but does not have its own function for use. This transition is discussed a bit more in the "Aggregating Longitudinal Data to Count Format" section below. 

### Long Format to Counting Process Format


The `long2cp()` function is used to transition data from long format to counting process format. It requires arguments for `data`, `id`, `time` and `status`. The `data` argument is used for the `data.frame` object for transition while `id` is the column name for the subject identifier. The `time` argument is for the column name of the single time column that is to be transitioned to an interval across columns while the `status` argument is used for any status indicator variables that need to be transitioned accordingly. The value of any status indicator variables is assumed to be associated with the end point of the interval. So, any column names supplied to the `status` argument are transitioned under this assumption. The result is the first value is dropped for each subject. All other columns are processed such that the value is associated with the first time point of the interval. This results in the last value for each subject being dropped. For constant columns, this is irrelevant but for any other time dependent covariates that are not named in the `status` argument, this is important to note.


The code below shows an example using the long format data from earlier.

```{r long2cp}
long2cp(data = long_data, id = "id", time = "time", status = "event")
```


Looking at the output above for the subject with `id` of 1, since the `event` column was supplied to the `status` argument then the values associated with the end points of that column are kept. All other time dependent covariates keep values associated with the starting points. In this case, that is the `var1` column. Using row 3 as an example, individual 1 experienced the event at time 96 which is kept as the value for `event`. However, their last recorded value for `var1` was 12.7 at time 64. This is then used for the 3rd row as 12.7 is used over the interval from time 64 to time 96.


By default, the function will keep all subjects through the `drop = FALSE` argument. However, it may be desirable to drop individuals that only have 1 row of data in long format as they do not have enough information for time intervals. Consider the `dropped_long` data set below which is missing the 2nd observation for `id` 3. This subject has only one time point so a start and stop time interval is not really feasible. However, to maintain the data set that is input to `long2cp()`, the subject is kept in the output data set with a time interval that starts and stops at the one time point given. All covariates are then associated with that one point.

```{r long2cp no drop}
dropped_long <- long_data[1:8,]
long2cp(dropped_long, id = "id", time = "time", status = "event", drop = "FALSE")
```

If the subject is intended to be dropped for analysis purposes, then the `drop = TRUE` argument can be used to exclude these rows from the output data set.

```{r long2cp drop}
long2cp(dropped_long, id = "id", time = "time", status = "event", drop = "TRUE")
```

The single row for `id` value 3 is missing in the output above. 

### Counting Process Format to Long Format

The transition from counting process format to long format is done through the `cp2long()` function. Arguments are similar to the previous function but take two time points as input instead of one. This is to consolidate the two end points of the interval into a single column.

```{r cp2long}
cp2long(data = cp_data, id = "id", time1 = "time1", time2 = "time2", status = "event")
```

Note that in this example there are a few `NA` values which differ from the example counting process table. This is due to the missing information about the first time covariates mentioned earlier. By default, the function will not make any assumptions about this data and will simply leave `NA` values for any unknown values. Therefore, any column names supplied to the `status` will be missing the first value for each `id` as these are assumed to be associated with the endpoint of the interval so there is no value for the starting point of the first interval. Similarly, any other time dependent columns will be missing the last value of each `id` as they are associated with the start time point over the whole interval so the last value of the last interval will be missing. If desired, the `fill = TRUE` option may be used in order to insert values for any constant columns that might result in an `NA` due to this reasoning.


```{r cp2long fill}
cp2long(cp_data, id = "id", time1 = "time1", time2 = "time2", status = "event", fill = TRUE)
```

Here, the `var2` column was constant and `NA` values were populated with more relevant values.

### Aggregating Longitudinal Data to Count Format

It may be useful to aggregate longitudinal data into count format with tallies of repeated observations. Transitions to the count format are done through the `long2count()` function. While originally conceived to perform the transition from long format to count format, the logic here is the same as that in the counting process format to count format so this function can be used for both transitions. The major difference between the long format and counting process format is how time is organized so the logic of aggregating any covariates row-wise should be the same but the aggregated values will be different across formats due to the 1 value loss of information from long format to counting process format.


The `long2count()` function requires a data set and identifier column to perform aggregation operations. It also requires an argument to either the `event` or `state` argument. This will determine how to treat events of interest under the observation scheme of the data set. The `event` argument will treat the given column or columns as numeric counts of observations while the `state` argument will treat the column largely as a factor with different levels. 


Consider the first example below with the long format data and the single event column called `event`.

```{r long2count}
long2count(long_data, id = "id", event = "event")
```
Since the argument supplied to `event` was a count of occurrences, the output column is a sum of rows. Any columns that are constant within the `id` argument grouping are carried over to the output as is while any that are time varying (or change row-wise) are then aggregated into the final count format output.


The default aggregation function for any time varying covariate is `mean()` but this can be changed through the `FUN` argument. For example, setting `FUN = max` will take the maximum value over time and use that as value in the aggregated count data set.


```{r long2count fun}
long2count(long_data, id = "id", event = "event", FUN = max)
```


For multiple events, the concatenated names of event columns can be supplied to the `event` argument.

```{r long2count events}
long_data2 <- long_data
long_data2$event2 <- c(2,2,3,1,2,3,2,1,3)
long2count(long_data2, id = "id", event = c("event","event2"))
```


If the observational thing is to be treated as a state instead of an event, then the state argument will handle the aggregations as such. Consider the `event` column as a binary state such that "0" is coded as healthy and "1" is coded for unhealthy. The `state` argument will instead count the number of occurrences of each state.

```{r long2count state}
long2count(long_data, id = "id", state = "event")
```

For the example above, subject 1 now has two rows for each of the different levels of "event". From the `event.counts` column, this subject has 3 occurrences of state 0 being healthy and 1 occurrence of state 1 being unhealthy. This can be expanded to larger state spaces with more detailed descriptions.


An argument is required to be supplied to either the `event` or `state` argument. The `event` argument will handle multiple columns as events but the `state` argument will only accept one column to be treated as a state. Combinations of `event` and `state` are allowed such as 1 state and 1 event or 1 state and multiple events.


## Helper Functions


There exist a few other functions within the `wlsd` package to help with small niche cases that may arise in practice. They are all formalized functions based on preparations of the `LBP` data set (discussed later). The three notable functions are `basedate()`, `events2state()`, and `takefirst()`. The first, `basedate()`, creates a new row for each individual which can serve as their baseline or first visit and then be populated with new information. The `events2state()` function is meant to consolidate multiple columns of event observations to a single state column. That is, it aggregates multiple columns into a single factor-like column based on the combinations of values from the input columns. The `takefirst()` function takes the first number of rows for each subject up to and including a given criteria. These functions are discussed in greater detail in the "Applications" section below as the example context gives greater understanding to their use. 


# Applications
## Example: Low-Back Pain Data Wrangling


In order to demonstrate the use cases of this package, consider this example using a low-back pain (LBP) data set that needs to be prepared for fitting to several models in a large survival analysis. The data set originally comes from @garg which was used in @mythesis for a comprehensive survival analysis where some of this example originates. See the appendix section of @mythesis a larger discussion on data preparation for this set using some early versions of this package.


The data is available within the package via `LBP`. It consists of 466 individuals with associated low back pain statuses over a 6 year observation period. The first 5 rows and the first 10 columns of data are shown below.

```{r LBP head}
LBP[1:5,1:10]
```

The data set is organized in long format with subject column `sid`, primary time column `time_to_row`, and several time varying LBP variables including `case.lbp`, `case.med`, `case.sc`, and `case.lt`. These last 4 variables represent the binary occurrence of LBP, LBP requiring medication, LBP requiring medical care, and LBP resulting in lost time from work, respectively, at the time of observation. See the help file via `help(LBP)` for more information. 


At the time of construction for this particular data set, the baseline data was not available. That is, the first observations were missing. For analysis purposes, the assumption was made that all individuals would be free from all LBP at this time point and the analysis would continue. That assumed information can be added to the data set through the `basedate()` function. In the code below, a new row is added for each individual that can then be populated with the new LBP values.


```{r baseline}
LBPwBaseline <- basedate(LBP, "sid")
LBPwBaseline[1:5,1:10]
```


As baseline information is unavailable, there are a few data points that require information. The baseline date for all individuals is stored as a constant in another column, `Baseline.date`. This data point can be dragged over to the baseline `Date` column for each individual. Subsequently, we can fill in 0s for all other missing data points as we will assume that each individual starts the study without any instances of the case definitions which is indicated by 0 and the baseline visit which is the first observation is time `time_to_row` equal to 0. 

```{r update baseline}
LBPwBaseline$Date <- ifelse(is.na(LBPwBaseline$Date), LBPwBaseline$Baseline.date, LBPwBaseline$Date)
LBPwBaseline[is.na(LBPwBaseline)] <- 0
```

Any constant columns within subject groups are carried over to the new row while anything that changes over time or row-wise, is left as `NA`. 


Through exploration, a few cases can be found where `case.lbp` is 0 while another "case." column is 1. 

```{r 0 example}
LBPwBaseline[404:406,1:8]
```


Intuitively, this does not make sense as these other cases are defined to be based on the subject having LBP and therefore `case.lbp`. This was most likely due to data entry errors for the `case.lbp` column so these are revalued to 1 if another case value is also 1.



```{r case progression}
LBPwBaseline$case.lbp <- ifelse(LBPwBaseline$case.lbp == 0 &
                               (LBPwBaseline$case.med == 1 | 
                                LBPwBaseline$case.sc == 1 | 
                                LBPwBaseline$case.lt == 1),
                                1, 
                                LBPwBaseline$case.lbp)
LBPwBaseline[404:406,1:8]
```



As is, the data set is in a format that can be used for a multi-state model analysis such as that provided by the `msm` package. The package uses a state argument which is taken as a single column. The multiple LBP columns cannot be passed in their current form. The `events2state()` function can be used to consolidate those multiple columns into a single column based on the different combinations of values.

```{r events2state}
LBP_msm <- events2state(LBPwBaseline, events = c("case.lbp",
                                                 "case.med",
                                                 "case.sc",
                                                 "case.lt"))
```

In the code above, the 4 given input columns have values which when combined give 9 total combinations. The combination levels are output to the console when the function runs. By default, these combinations are changed to numbers through the `number = TRUE` argument. The `interaction()` function handles the combinations of columns which by default drops unused combinations. This can be toggled in the `drop` argument by `drop = FALSE`.


The `LBP_msm` data set is suitable input for the `msm` package. This is largely the form used for the analysis in @mythesis which has a section on using the `msm` package for multi-state model analysis of this data set.



In order to transition the data set to counting process format, the `long2cp()` function is used below.

```{r LBP cp}
LBP_cp <- long2cp(LBPwBaseline, id = "sid", time = "time_to_row", status = c("case.lbp", "case.med", "case.sc", "case.lt"))
LBP_cp[1:5,1:10]
```

Any subjects with 1 row are dropped since they do not have enough time information for the desired analysis. The `survival` package contains analysis models that take data in counting process format as input. The `LBP_cp` data set has multiple occurrences of LBP so a repeated event or recurrent model might be more suitable.

If a simple Cox proportional-hazards model for the time to first instance of LBP is desired, all observations after the first occurrence of LBP can be truncated using the `takefirst()` function. 


```{r takefirst}
LBP_cp2 <- takefirst(LBP_cp, id = "sid",
                     criteria.column = "case.lbp",
                     criteria = 1)

LBP_cp2[11:14,1:8]
```
   
In this form, `LBP_cp2` has all rows up to and including the first time `case.lbp` is 1 for each subject. From this point, the data can be supplied to the `coxph()` function in `survival`. 



The transitions to count format are done through `long2count()`. Only the LBP related values and time itself vary row-wise. As events, the LBP columns can be counted through the `event` argument as demonstrated below.


```{r LBP count}
LBP_count <- long2count(LBPwBaseline, "sid", event = c("case.lbp",
                                                 "case.med",
                                                 "case.sc",
                                                 "case.lt"))

LBP_count[1:5,c(1,20:24)]
```


If looking at the `LBP_msm` data set with the state column, the `state` argument can be used and we can sum all the LBP variables to count their occurrences over time.

```{r LBP state}
LBP_state <- long2count(LBP_msm, "sid", state = "state", FUN = sum)
LBP_state[1:9,c(1,5:8,24:26)]
```

In the case of using the `state` argument, all 9 states are counted for each subject. The other LBP columns are summed getting the total number of occurrences for each column. Either count format data set should be suitable for some sort of count regression model. The former might offer a more straightforward approach relating a single count to covariates of interest. The latter data set might be suitable with a random effect and a reduced state space.
See @mythesis for an example analysis using count regression on only the counts of `case.lbp`.



# References