--- title: "Getting Started with `integrity`" author: Sol Libesman, David Nguyen, Dario Strbenac, Jie Kang, Lene Seidler, Kylie Hunter
The University of Sydney, Australia. output: html_document: toc: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{An Introduction to the integrity Package} --- ## Package overview Increasing concerns about the trustworthiness of research have prompted calls to scrutinise studies' individual participant data (IPD), that is, de-identified raw line-by-line data for each participant in a study. `integrity` was developed to support application of the IPD Integrity Tool [(Hunter et al. 2024A)](https://onlinelibrary.wiley.com/doi/full/10.1002/jrsm.1738). It enables structured and transparent assessment of the integrity and trustworthiness of randomised controlled trials (RCTs) using IPD and informs decisions about whether RCTs should be included in evidence synthesis or considered suitable for publication. Further information may be found about the development of the tool [here - (Hunter et al., 2024B)](https://onlinelibrary.wiley.com/doi/full/10.1002/jrsm.1739). If you use our package please cite:     Hunter KE, Aberoumand M, Libesman S, Sotiropoulos JX, Williams JG, Aagerup J, Wang R, Mol BW, Li W, Barba A, Shrestha N. Webster AC, Seidler AL. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. *Research Synthesis Methods*. 2024 Nov;15(6):917-39. ## How to use the package Each step of the workflow is illustrated using a case study on umbilical cord management at preterm birth, based on a de-identified and altered data set from the [iCOMP study](https://doi.org/10.1016/S0140-6736(23)02468-6). The main goal was to determine the optimal umbilical cord management strategy at preterm birth, such as milking or delayed cord clamping. ## Step 1: Data loading Load the integrity package into R. ```{r} if(requireNamespace("pkgload", quietly = TRUE) && file.exists("../DESCRIPTION")) { pkgload::load_all("..") } else { library(integrity) } ``` Next, import the data set you wish to examine into R. There are a variety of functions in R or CRAN packages to do this: - `read.csv` and `read.table` functions to import comma-separated and tab-separated text files. - `read.sas` for SAS, `read.sav` for SPSS and `read_dta` for STATA in the CRAN package [haven](https://haven.tidyverse.org/). - `read_excel` function for Microsoft Excel in the CRAN package [readxl](https://readxl.tidyverse.org/). Case study: The altered iCOMP case study is loaded with the `integrity` package. The data are in a Microsoft Excel file. ```{r, warning =F} library(readxl) examplePath <- system.file("extdata", "dataset.xlsx", package = "integrity") dataset <- read_excel(examplePath, sheet=1) dataset[1:5, ] ``` In the tibble above, the sample identifiers can be seen (`infant_ID`), as well as the date of randomisation (`rand_date`) and the first few clinical covariates. ## Step 2: Data preparation The following elements are required to be paired with the corresponding column names in your data set: - `participantID`: The name of the column which corresponds to the unique participant identifier (this variable is mandatory). - `enrollment`: lists the names of three columns corresponding to `start` (date of first participant enrollment), `randomisation` (date of participant randomisation) and `end` (date of the last participant enrollment). - `baseline`: lists named `dichotomous`, `polytomous`, `numeric` are for specifying the column name(s) of the column(s) which correspond(s) to baseline measurements. - `intervention`: the name of the column specifying the intervention or group allocation for each individual (this variable is mandatory). - `outcome`: lists named `common` and `rare`, with sublists named by `dichotomous`, `numeric` or `polytomous`, containing the names of columns of those data types for outcomes assessed. - `correlated`: A named list of two entries of column names that are expected to be correlated. - `unexpected`: A named list of column names with values that are not expected to be seen. `days` is a special sublist and applies to date columns, which are converted into days of the week before comparison. It must have two elements: `names`, which are the unexpected day names, and `locale`, which is the locale of the unexpected day names specified. Only `participantID` is strictly required. `enrollment`, `baseline`, `intervention`, `outcome`, `correlated`, and `unexpected` should be supplied when available; if a section is omitted, the checks that depend on it will be skipped. The variable types and expectations need to be defined before running the checks. The package accepts the same metadata structure which may be created in multiple different formats depending on your preference: a list written directly in a R script or markdown, or an Excel template workbook. Coding the list directly in R as below, is often the simplest option for users already working inside an R script or an R Markdown document, because the metadata can be written directly next to the analysis code. The R code below may be used as a template and altered based on relevant variables in a new dataset. ```{r} dataset_info <- list( participantID = "infant_ID", enrollment = list( start = "enrol_start", randomisation = "rand_date", end = "enrol_end" ), baseline = list( dichotomous = c("sex"), # add more variables if needed e.g., c("sex", "respiratory_support") #polytomous = c("ordinal_or_nominal_variable"), # no polytomous baseline variables in this data set so it's commented out numeric = c("mat_age", "GA_weeks", "birthweight") # can add polytomous variables if needed ), intervention = "treatment_cat", outcome = list( common = list( dichotomous = c("IVH", "NEC"), polytomous = c("CLD"), # if certain variable types don't exist, just delete the relevant line. numeric = c("hospital_days") ), rare = list( dichotomous = c("inf_death") # add more variables if needed e.g., c("inf_death", "severe_IVH") ) ), correlated = list( timeAndSize = c("GA_weeks", "birthweight") ), unexpected = list( days = list( names = c("Saturday", "Sunday"), locale = "C" ), mat_age = c("less than 10", "greater than 50"), GA_weeks = c("less than 22", "greater than 37") ) ) ``` In the `unexpected$days` section, `locale` controls the language used when R converts dates into weekday names. `locale = "C"` is the most likely option to use because it returns standard English weekday names such as `Saturday` and `Sunday`, which usually match the values entered in `names`. Other locale values are possible if your system or dataset uses a different language or naming convention, but `"C"` will usually be the safest default. An Excel template is also available if users prefer to enter the metadata in a spreadsheet. The workbook has one row per entry and four columns named `level_1`, `level_2`, `level_3`, and `value`. Repeated values, such as several numeric baseline variables, are entered as multiple rows. This workbook can be edited in Microsoft Excel, then imported into R with `read_metadata_excel()`. ```{r, eval=FALSE} example_excel_path <- system.file("extdata", "variables_template.xlsx", package = "integrity") dataset_info <- read_metadata_excel(example_excel_path) ``` ## Step 3: Running integrity checks Simply provide the data frame and data information to `run_checks`. The function first performs automated data checking and cleaning to ensure that all variables defined in the `dataset_info` file are present in the dataset. The function will also convert columns nominated as factors into factors where required, and remove any columns containing only missing values. ```{r} result <- run_checks(dataset, dataset_info) names(result) ``` This creates a list of result objects, including overall check tables, detailed per-variable tables for selected checks, plots, and summary tables. The output for each item below should be reviewed consecutively and rated using the decision guide and rating sheet [(found here - Hunter et al. 2024A)](https://onlinelibrary.wiley.com/doi/full/10.1002/jrsm.1738) ## Step 4: Reviewing results by integrity domain The sections below present the output from the integrity `run_checks` function, split under each domain and item. ### Domain 1: Unusual or repeated data patterns **Item 1.1**: Repeating patterns within baseline variables This item is manually performed by sorting and visually inspecting the data to identify repeating patterns within baseline variables, e.g. check whether values appear to repeat at regular intervals, which may indicate rows were copied and pasted. Rare or unusual entries can be particularly useful for detecting such patterns; assess whether these entries recur systematically, such as every 11 rows. Perform these assessments using the original dataset order, randomisation order, and separately within each study group. ```{r} item_1_1 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "1.1", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_1_1, row.names = FALSE) ``` **Item 1.2**: Repeating data patterns across baseline variables This item looks for duplication across participants, e.g. do all participants with a height of 180cm have the same weight? Duplicate entries for baseline variables are listed below (if there are none, no input will be printed). ```{r} item_1_2 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "1.2", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_1_2, row.names = FALSE) ``` **Item 1.3**: Repeating data patterns across baseline variables and rare variables. ```{r} item_1_3 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "1.3", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_1_3, row.names = FALSE) ``` **Item 1.4**: Bias in the terminal (rightmost) digits. This item plots the terminal digit for the selected continuous variables (avoid variables that tend to be rounded or that lack precision). Inspect the bar charts for biased or unexpected distribution. ```{r} item_1_4 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "1.4", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_1_4, row.names = FALSE) if("Terminal Digits" %in% names(result[["images"]])) { result[["images"]][["Terminal Digits"]] } ``` ### Domain 2: Baseline characteristics **Item 2.1**: Excessively homogeneous distribution of binary baseline variables, i.e. loss of independence between consecutive variables In RCTs we expect binary baseline data to occur in a manner independent of previous values (i.e., to occur randomly). The runs test examines whether baseline data occurs in a random manner based on row order. Statistically significant (p < 0.05) chi-squared tests may be indicative of an integrity issue. Note: if row order is not sorted chronologically by randomisation date and time this test may be invalid. ```{r} item_2_1 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "2.1", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_2_1, row.names = FALSE) if(!is.null(result[["detail_tables"]][["2.1"]])) knitr::kable(result[["detail_tables"]][["2.1"]], row.names = FALSE) ``` **Item 2.2**: Excessive imbalances between groups in continuous baseline variables. Evaluates mean and standard deviation for key prognostic factors that are continuous, split by treatment group. ```{r} item_2_2 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "2.2", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_2_2, row.names = FALSE) if(!is.null(result[["detail_tables"]][["2.2"]])) knitr::kable(result[["detail_tables"]][["2.2"]], row.names = FALSE) ``` **Item 2.3**: Excessive imbalances in baseline categorical variables between groups. This item assesses whether counts of baseline categorical variables are significantly different (p<0.05) between groups. ```{r} item_2_3 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "2.3", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_2_3, row.names = FALSE) if(!is.null(result[["detail_tables"]][["2.3"]])) knitr::kable(result[["detail_tables"]][["2.3"]], row.names = FALSE) ``` **Item 2.4**: Significant difference in variance of continuous baseline variables between groups. This item uses Levene's test, which checks whether there is a significant difference in variability between groups. ```{r} item_2_4 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "2.4", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_2_4, row.names = FALSE) if(!is.null(result[["detail_tables"]][["2.4"]])) knitr::kable(result[["detail_tables"]][["2.4"]], row.names = FALSE) ``` ### Domain 3: Correlations **Item 3.1**: No association between variables known to be highly correlated. This item plots the correlation between selected continuous variables and calculates the Pearson correlation coefficient (R) and associated p value. Check whether expected correlations are present. ```{r} item_3_1 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "3.1", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_3_1, row.names = FALSE) correlation_plots <- setdiff(names(result[["images"]]), c("Terminal Digits", "Cumulative Allocation", "Days")) for(plot_name in correlation_plots) { print(result[["images"]][[plot_name]]) } ``` ### Domain 4: Date violations **Item 4.1**: Individual enrolment dates do not fit within study start and end dates. This item examines whether randomisation dates for each individual fall within the enrolment period. ```{r} item_4_1 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "4.1", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_4_1, row.names = FALSE) item_4_1_dates <- result[["detail_tables"]][["4.1"]] if(!is.null(item_4_1_dates)) { knitr::kable(item_4_1_dates, row.names = FALSE) } ``` **Item 4.2**: Dates (or visits) are not in logical order. Requires study-specific repeated visits or event-date variables; for example, a follow-up date occurring before enrollment. ```{r} item_4_2 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "4.2", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_4_2, row.names = FALSE) ``` ### Domain 5: Patterns of allocation **Item 5.1**: Non-random allocation patterns: plot. The plot below shows the cumulative number of allocated participants to each treatment arm by date of randomisation. We expect the cumulative number of randomised participants in each group to be similar if 1:1 allocation is used. Assess whether cumulative lines for treatment groups deviate from each other drastically. Note: the graphs will only appear when the date of randomisation is provided. ```{r} item_5_1 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "5.1", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_5_1, row.names = FALSE) if("Cumulative Allocation" %in% names(result[["images"]])) { result[["images"]][["Cumulative Allocation"]] } ``` **Item 5.2**: Non-random allocation patterns: statistical test This item evaluates randomness of allocation using two approaches: a runs test and a chi-squared test comparing observed adjacent intervention runs with the expected number under random allocation. A statistically significant result (p<0.05) from either test may be indicative of an issue with randomisation. ```{r} item_5_2 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "5.2", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_5_2, row.names = FALSE) item_5_2_test <- result[["detail_tables"]][["5.2"]] if(!is.null(item_5_2_test)) { knitr::kable(item_5_2_test, row.names = FALSE) } ``` **Item 5.3**: Unexpected imbalance in randomisation day of week. The table below reports two chi-squared tests: one assessing whether randomisation is distributed evenly across weekdays overall, and one assessing whether randomisation day differs by intervention group. The graph below shows the number of participants randomised on each day of the week by group. We expect numbers to be balanced between groups for each weekday, and fewer enrolments on the weekend for non-urgent interventions. Note: the graph will only appear when the date of randomisation is provided. ```{r} item_5_3 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "5.3", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_5_3, row.names = FALSE) item_5_3_test <- result[["detail_tables"]][["5.3"]] if(!is.null(item_5_3_test)) { knitr::kable(item_5_3_test, row.names = FALSE) } if("Days" %in% names(result[["images"]])) { result[["images"]][["Days"]] } ``` ### Domain 6: Internal inconsistencies **Item 6.1**: Inconsistent or illogical values across variables within individual participants. Derive logic rules for each variable to be collected, e.g. date of hospital discharge = date of admission + days in hospital; if number of transfusions ≥1, then any transfusion = yes. Incorporate these rules into the package so that any breaches are displayed in the output ```{r} item_6_1 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "6.1", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_6_1, row.names = FALSE) ``` ### Domain 7: External inconsistencies **Item 7.1**: IPD do not correspond to publications or reports. The table below shows summary statistics for each variable provided in the IPD dataset, e.g. mean, median, range, etc. Manually cross‐check these against any published trial reports, including appendices and supplements. Record any inconsistencies identified, for example, discrepancies in summary variable values between IPD and publication, inclusion of participants in IPD that do not meet eligibility criteria in publication, published variables that are missing from IPD dataset. If data are provided for excluded participants, check whether reasons for exclusion are consistent with publication. ```{r} item_7_1 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "7.1", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_7_1, row.names = FALSE) if(!is.null(result[["summary_table"]])) { result[["summary_table"]] } ``` ### Domain 8: Plausibility of data **Item 8.1**: Too few missing data or missing data are overly similar between groups. The table below shows missingness by intervention group for outcome variables, including the percentage missing in each group. ```{r} item_8_1 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "8.1", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_8_1, row.names = FALSE) if(!is.null(result[["detail_tables"]][["8.1"]])) knitr::kable(result[["detail_tables"]][["8.1"]], row.names = FALSE) ``` **Item 8.2**: Implausible event rates: outcomes and demographics. The table below shows events and totals for dichotomous baseline variables and dichotomous common and rare outcomes, split by intervention group. ```{r} item_8_2 <- result[["check_table"]][result[["check_table"]][["ItemNumber"]] == "8.2", c("Item description", "Status", "Details"), drop = FALSE] knitr::kable(item_8_2, row.names = FALSE) if(!is.null(result[["detail_tables"]][["8.2"]])) knitr::kable(result[["detail_tables"]][["8.2"]], row.names = FALSE) ``` ## Computing Environment This vignette was executed on the following computing system: ```{r} sessionInfo() ```