--- output: html_vignette: toc: true number_sections: true toc_depth: 3 title: "visStatistics" author: "Sabine Schilling" date: "`r Sys.Date()`" bibliography: visstat.bib vignette: > %\VignetteIndexEntry{visStatistics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} editor_options: markdown: wrap: sentence --- ```{r, include = FALSE} knitr::opts_chunk$set( echo = TRUE, collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 5, out.width = "100%" ) ``` ```{r setup} library(visStatistics) ``` # Abstract {.unnumbered} The package `visStatistics` enables rapid visualization and statistical analysis of raw data by automatically selecting the hypothesis test with the highest statistical power to evaluate the relationship between a response (`varsample`) and a feature (`varfactor`) within a `data.frame`. Its core function, `visstat()`, generates a graph showing key statistics from the selected test and returns the corresponding test results, including any post-hoc analysis. A minimal function call has the form `visstat(dataframe, varsample, varfactor)`. The input `data.frame` must be column-based, and the arguments `varsample` and `varfactor` must be character strings naming columns of the `data.frame`. The significance level $\alpha$, used throughout for hypothesis testing, is defined as `1 - conf.level`, where `conf.level` is a user-controllable argument (defaulting to `0.95`). Data of class `"numeric"` or `"integer"` are referred to in the remainder of this vignette as numerical, while data of class `"factor"` are referred to as categorical. The choice of statistical tests performed by the function `visstat()` depends on whether the data are numerical or categorical, the number of levels in the categorical variable, and the distribution of the data. The function prioritizes interpretable visual output and tests that remain valid under the assumptions evaluated applying the following decision logic: (1) When the response is numerical and the predictor is categorical, test of central tendencies are performed. If the categorical predictor has two levels, Welch's t-test (`t.test()`) is used, if both groups have more than 30 observations, relying on the central limit theorem to justify its application regardless of underlying normality [@Lumley2002dsa]. For smaller samples, group-wise normality is assessed using the Shapiro-Wilk test (`shapiro.test()`). If both groups return p-values greater than $\alpha$, Welch's t-test is applied; otherwise, the Wilcoxon rank-sum test (`wilcox.test()`) is used. For predictors with more than two levels, an ANOVA model (`aov()`) is initially fitted. Residual normality is evaluated using both the Shapiro-Wilk test (`shapiro.test()`) and Anderson-Darling test (`ad.test()`), and normality is assumed if either test yields $p > \alpha$. If the residuals are considered normal, Bartlett's test (`bartlett.test()`) is used to assess homoscedasticity. When variances are homogeneous ($p > \alpha$), ANOVA is applied with Tukey's HSD (`TukeyHSD()`) for post-hoc comparison. If variances differ significantly ($p \le \alpha$), Welch's one-way test (`oneway.test()`) is used, also followed by Tukey's HSD. If residuals are not normally distributed according to both tests ($p \le \alpha$), the Kruskal-Wallis test (`kruskal.test()`) is selected, followed by pairwise Wilcoxon tests (`pairwise.wilcox.test()`). (2) When both the response and predictor are numerical, a simple linear regression model (`lm()`) is fitted and analysed in detail, including residual diagnostics, formal tests, and the plotting of fitted values with confidence bands. Note that only one explanatory variable is allowed, as the function is designed for two-dimensional visualisation. (3) In the case of two categorical variables, `visstat()` tests the null hypothesis that the predictor and response variables are independent using either `chisq.test()` or `fisher.test()`. The choice of test is based on Cochran's rule [@Cochran], which advises that the $\chi^2$ approximation is reliable only if no expected cell count is zero and no more than 20 percent of cells have expected counts below 5. Note: Except for the user-adjustable `conf.level` parameter, all statistical tests are applied using their default settings from the corresponding base R functions (e.g., `t.test()`). As a consequence, paired tests are not currently supported. Furthermore, since the main purpose of this package is to visualize statistical test results, only simple linear regression is implemented. # Decision logic for selecting tests of central tendency {#sec-decisiontree} If the feature consists of `class` "`factor`" with two or more levels and the response consists is of `class` "`numeric`" or "`integer`" (both having mode "`numeric`"), statistical tests are applied to compare the central tendencies across groups. This section describes the conditions under which parametric and non-parametric tests are chosen, based on the response type, the number of factor levels, and the underlying distributional assumptions. A graphical overview of the decision logic used is provided in below figure. ```{r fig-decision-tree, fig.cap="Decision tree used to select the appropriate statistical test for a categorical predictor and numerical response, based on the number of factor levels, normality and homoscedasticity.", out.width="100%,fig.height =100% "} knitr::include_graphics("../man/figures/decision_tree.png") ``` ## Two sample tests: Welch's t-test or Wilcoxon rank sum test When the feature variable has exactly two levels, Welch's t-test is used to compare group means if the assumption of normality is satisfied. Otherwise, the Wilcoxon rank sum test is applied. Welch's t-test evaluates the null hypothesis that the two groups have equal means. Unlike Student's t-test, which assumes equal variances (homoscedasticity), Welch's test does not require this assumption. Welch's test exhibits minimal loss of robustness even when Student's t-test assumptions are satisfied [@Moser:1992fp; @Delacre:2017iv]. Therefore, Student's t-test is not implemented. The Wilcoxon rank sum test, in contrast, is a non-parametric method that compares two distributions without requiring normality. The function selects between these tests as follows. If both groups have more than 30 observations, Welch's t-test (`t.test()`) is always performed. For large samples, the test remains robust even when normality is not strictly met [@Rasch:2011vl; @Lumley2002dsa]. If either group has fewer than 30 observations, the function applies the Shapiro-Wilk test (`shapiro.test()`) to each group. If both $p$-values exceed the significance level $\alpha = 1 -$ `conf.level`, Welch's t-test is applied. If at least one $p$-value falls below $\alpha$, the Wilcoxon rank sum test (`wilcox.test()`) is used instead. The graphical output consists of box plots overlaid with jittered points to display individual observations. When Welch's t-test is applied, the function includes confidence intervals based on the user-specified `conf.level`. The function returns a list containing the results of the applied test and the summary statistics used to construct the plot. ### Examples #### Welch's t-test {.unnumbered} As an example, we use the *Motor Trend Car Road Tests* data set (`mtcars`), which contains 32 observations. In the example below, `mpg` denotes miles per (US) gallon, and `am` represents the transmission type (`0` = automatic, `1` = manual). ```{r} mtcars$am <- as.factor(mtcars$am) t_test_statistics <- visstat(mtcars, "mpg", "am") ``` Increasing the confidence level `conf.level` from the default 0.95 to 0.99 results in wider confidence intervals, as a higher confidence level requires more conservative bounds to ensure that the interval includes the true parameter value with greater certainty. ```{r} mtcars$am <- as.factor(mtcars$am) t_test_statistics_99 <- visstat(mtcars, "mpg", "am", conf.level = 0.99) ``` #### Wilcoxon rank sum test {.unnumbered} ```{r} grades_gender <- data.frame( sex = as.factor(c(rep("girl", 21), rep("boy", 23))), grade = c( 19.3, 18.1, 15.2, 18.3, 7.9, 6.2, 19.4, 20.3, 9.3, 11.3, 18.2, 17.5, 10.2, 20.1, 13.3, 17.2, 15.1, 16.2, 17.0, 16.5, 5.1, 15.3, 17.1, 14.8, 15.4, 14.4, 7.5, 15.5, 6.0, 17.4, 7.3, 14.3, 13.5, 8.0, 19.5, 13.4, 17.9, 17.7, 16.4, 15.6, 17.3, 19.9, 4.4, 2.1 ) ) wilcoxon_statistics <- visstat(grades_gender, "grade", "sex") ``` ## One-way test, ANOVA or Kruskal-Wallis test If the feature is of `class` "`factor`" with more than two levels and the response is of `mode` "`numeric`", `visstat()` initially attempts an analysis of variance (ANOVA). ANOVA is performed only if both of the following null hypotheses cannot be rejected at the specified `conf.level`: 1. The standardized residuals follow a normal distribution, and\ 2. The residuals exhibit homoscedasticity (equal variances across groups). If only the assumption of normality is satisfied, `visstat()` applies Welch's one-way test (`oneway.test()`). If the normality assumption is violated, a Kruskal-W allis test (`kruskal.test()`) is used instead. These assumptions are tested internally using the `visAnovaAssumptions()` function. ### Checking the ANOVA assumptions #### Residual analysis {.unnumbered} The `visAnovaAssumptions()` function assesses the normality of standardized residuals from the ANOVA fit using both the Shapiro-W ilk test (`shapiro.test()`) and the Anderson-D arling test (`ad.test()`). Normality is assumed if at least one of the two tests yields a $p$-value greater than $1 -$ `conf.level`. The function generates two diagnostic plots:\ - a scatterplot of the standardized residuals against the fitted means of the linear model for each level of the feature (`varfactor`), and\ - a Q-Q plot of the standardized residuals. #### Homoscedasticity: homogeneity of variances in each level: Bartlett test {.unnumbered} Both `aov()` and `oneway.test()` assess whether two or more samples drawn from normal distributions have the same mean. While `aov()` assumes homogeneity of variances across groups, `oneway.test()` does not require the variances to be equal. Homoscedasticity is assessed using Bartlett`s test (`bartlett.test()\`), which tests the null hypothesis that the variances across all levels of the grouping variable are equal. ### One-way test and ANOVA Depending on the $p$-value returned by `bartlett.test()`, the corresponding test is selected and its $p$-value is displayed in the figure title: - If the $p$-value of `bartlett.test()` is greater than $1 -$ `conf.level`, we assume homogeneity of variances across groups and report the $p$-value from the ANOVA (`aov()`). - Otherwise, homoscedasticity cannot be assumed, and the function reports the $p$-value from Welch's one-way test (`oneway.test()`). #### Post-hoc analysis: Tukey's Honestly Significant Differences (HSD) and Sidak-corrected confidence intervals {.unnumbered} Simple pairwise comparisons of group means following an analysis of variance increase the probability of incorrectly declaring a significant difference when, in fact, there is none . This inflation of error is quantified by the family-wise error rate, which refers to the probability of making at least one Type I error, that is, falsely rejecting the null hypothesis across all pairwise comparisons. To control it, `visstat()` performs post-hoc analysis using Tukey's Honestly Significant Difference (HSD) test and displays Sidak-corrected confidence intervals. The `visstat()` function controls the probability of a Type I error by applying Tukey\`s Honestly Significant Differences procedure, as implemented in `TukeyHSD()`.\ Based on the specified confidence level (`conf.level`), it constructs a set of confidence intervals for all pairwise differences between factor level means. A significant difference between two means is indicated when the corresponding confidence interval does not include zero.\ The function returns both the HSD-adjusted $p$-values and the associated confidence intervals for all pairwise comparisons. In the graphical output for the one-way test and ANOVA, green letters displayed below each group summarize the results of the Tukey HSD post-hoc test: Two groups are considered significantly different if they are assigned different letters, indicating a Tukey HSD-adjusted $p$-value smaller than $\alpha = 1 -$ `conf.level`. Tukey\`s HSD procedure is based on pairwise comparisons of the differences between the means at each factor level and produces a set of corresponding confidence intervals. The Sidak procedure, on the other hand, addresses the problems of a type I error by lowering the acceptable probability of a type I error for all comparisons of the levels of the independent, categorical variable. The Sidak corrected acceptable probability of error [@Sidak] is defined as $\alpha_{Sidak}=(1$-`conf.int`$)^{1/M}$, where $M=\frac{n\cdot (n-1)}{2}$ is the number of pairwise comparisons of the $n$ levels of the categorical variable. In the graphical display of One-way test and ANOVA, `visstat()` displays both the `conf.level` $\cdot\; 100 \%$-confidence intervals alongside the larger, Sidak-corrected $(1-\alpha_{Sidak})\cdot 100\;\%$-confidence intervals. Note that the current structure of `visstat()` does not allow the study of interactions between the different levels of an independent variable. ### Kruskal-Wallis test If the $p$-value from the Shapiro-Wilk test (`shapiro.test()`) applied to the standardized residuals is smaller than the significance level $1 -$ `conf.level`, `visstat()` selects a non-parametric alternative: the Kruskal-Wallis rank sum test. The function `kruskal.test()` tests the null hypothesis that the group medians are equal across all levels of the categorical feature. #### Post-hoc analysis: `pairwise.wilcox.test()` {.unnumbered} As post-hoc analysis following the Kruskal-Wallis test, `visstat()` applies the pairwise Wilcoxon rank sum test using `pairwise.wilcox.test()`, with Holm-s method as the default adjustment for multiple comparisons [@Holm1979ASS]. If the Holm-adjusted $p$-value for a given pair of groups is smaller than the significance level $1 -$ `conf.level`, the green letters displayed below the corresponding box plots will differ. Otherwise, the groups are considered not significantly different. Apart from the multiple comparison adjustment, the graphical representation of the Kruskal-Wallis result is similar to that used for the Wilcoxon rank sum test. The function returns a list containing the test statistic from the Kruskal-Wallis rank sum test, along with the Holm-adjusted $p$-values for all pairwise comparisons. ### Examples #### One-way test: {.unnumbered} The `npk` dataset reports the yield of peas (in pounds per block) from an agricultural experiment conducted on six blocks. In this experiment, the application of three different fertilisers - nitrogen (N), phosphate (P), and potassium (K) - was varied systematically. Each block received either none, one, two, or all three of the fertilisers, ```{r} oneway_npk <- visstat(npk, "yield", "block") ``` Normality of residuals is supported by graphical diagnostics (scatter plot of standardized residuals, Q-Q plot) and formal tests (Shapiro-Wilk and Anderson-Darling, both with $p > \alpha$). However, homogeneity of variances is not supported at the given confidence level ($p < \alpha$, `bartlett.test()`), so the $p$-value from the variance-robust `oneway.test()` is reported. Post-hoc analysis with `TukeyHSD()` shows no significant yield differences between blocks, as all share the same group label (e.g., all green letters). #### ANOVA {.unnumbered} The `InsectSprays` dataset reports insect counts from agricultural experimental units treated with six different insecticides. To stabilise the variance in counts, we apply a square root transformation to the response variable. ```{r} insect_sprays_tr <- InsectSprays insect_sprays_tr$count_sqrt <- sqrt(InsectSprays$count) visstat(insect_sprays_tr, "count_sqrt", "spray") ``` After the transformation, the homogeneity of variances can be assumed ($p> \alpha$ as calculated with the `bartlett.test()`), and the $p$-value of the `aov()` is displayed. #### Kruskal-Wallis rank sum test {.unnumbered} The `iris` dataset contains petal width measurements (in cm) for three different iris species. ```{r} visstat(iris, "Petal.Width", "Species") ``` In this example, scatter plots of the standardised residuals and the Q-Q plot suggest that the residuals are not normally distributed. This is confirmed by very small $p$-values from both the Shapiro-Wilk and Anderson-Darling tests. If both $p$-values are below the significance level $\alpha = 1 -$conf.level, `visstat()` switches to the non-parametric `kruskal.test()`. Post-hoc analysis using `pairwise.wilcox.test()` shows significant differences in petal width between all three species, as indicated by distinct group labels (all green letters differ). # Simple linear regression If the feature `varfactor` and the response `varsample` have only one level of type `numerical` or `integer`, `visstat()` performs a simple linear regression. Note that multiple linear regression is **not** implemented, as the main focus of `visstat()` is the visualisation of statistical tests. If the feature `varfactor` and the response `varsample` contain only one level each and both are of type `numeric` or `integer`, `visstat()` performs a simple linear regression. Note that multiple linear regression is not implemented, as the main focus of `visstat()` is on the visualisation of statistical tests. ## Residual analysis `visstat()` checks the normality of the standardised residuals from `lm()` both graphically and using the Shapiro-Wilk and Anderson-Darling tests. If the $p$-values for the null hypothesis of normally distributed residuals from both tests are smaller than $1 -$`conf.int`, the title of the residual plot will display the message: "Requirement of normally distributed residuals not met". Regardless of the result of the residual analysis, `visstat()` proceeds to perform the regression. The title of the graphical output indicates the chosen confidence level (`conf.level`), the estimated regression parameters with their confidence intervals and $p$-values, and the adjusted $R^2$. The plot displays the raw data, the fitted regression line, and both the confidence and prediction bands corresponding to the specified `conf.level`. `visstat()` returns a list containing the regression test statistics, the $p$-values from the normality tests of the standardised residuals, and the pointwise estimates of the confidence and prediction bands. ## Examples ### Data set: `cars`{.unnumbered} The `cars` dataset reports the speed of cars in miles per hour (`speed`) and the stopping distance in feet (`dist`). ```{r} linreg_cars <- visstat(cars, "dist", "speed") ``` Increasing the confidence level `conf.level` from the default 0.95 to 0.99 results in wider confidence and prediction bands: ```{r} linreg_cars <- visstat(cars, "dist", "speed", conf.level = 0.99) ``` $p$-values greater than `conf.level` in both Anderson-Darling normality test and the Shapiro-Wilk test of the standardised residuals indicate that the normality assumption of the residuals underlying the linear regression is met. ```{r} linreg_trees <- visstat(trees, "Volume", "Girth", conf.level = 0.9) ``` ### Data set: `trees`{.unnumbered} The `trees` dataset provides measurements of the diameter (`Girth`, in inches), `Height` (in feet), and `Volume` (in cubic feet) of black cherry trees. In this example, we choose `Volume` as the response and `Girth` as the feature. ```{r} linreg_cars <- visstat(trees, "Volume", "Girth", conf.level = 0.9) ``` Both the graphical analysis of the standardised residuals and $p$-values greater than $1 -$conf.level in the Anderson-Darling and Shapiro-Wilk tests suggest that the assumption of normally distributed residuals is met. Furthermore, the linear regression model explains 93% of the total variance in the response variable `Volume`. # ${\chi}^2$- and Fisher Test When both the feature `varfactor` and the response `varsample` are categorical variables (i.e., of type `factor`), `visstat()` tests the null hypothesis that the two variables are independent. Categorical data are typically represented as contingency tables, which cross-tabulate the observed frequencies for each combination of factor levels. Based on these observed frequencies, `visstat()` computes the expected frequencies under the null hypothesis of independence. If the expected frequencies are sufficiently large - specifically, if at least 80% of the cells have expected counts greater than 5 and no expected count is zero - the function uses Pearson's ${\chi}^2$-test (`chisq.test()`). Otherwise, it switches to Fisher's exact test (`fisher.test()`) [@Cochran]. For 2-by-2 contingency tables, Yate's a continuity correction [@Yates:1934] is applied to Pearson's ${\chi}^2$-test. For both ${\chi}^2$- and Fisher Test 'visstat()' displays a grouped column plot that includes the respective test's $p$-value in the title, as well as a mosaic plot showing Pearson residuals (see the documentation of `mosaic()` in the `vcd` package for details). All features of `visstat()` will be illustrated using the built-in data set `HairEyeColor`. ## Data preparation: Transforming a contingency table to `data.frame` `visstat()` requires input in the form of a column-based `data.frame`. Contingency tables can be converted to this format using the helper function `counts_to_cases()`. `counts_to_cases()` transforms the contingency table `HairEyeColor` into `data.frame` named `HairEyeColorDataFrame`. ```{r} HairEyeColorDataFrame <- counts_to_cases(as.data.frame(HairEyeColor)) ``` ## Pearson's ${\chi}^2$-test ```{r} hair_eye_color_df <- counts_to_cases(as.data.frame(HairEyeColor)) visstat(hair_eye_color_df, "Hair", "Eye") ``` ### Pearson's ${\chi}^2$-test with Yate's continuity correction For 2-by-2 contingency tables, `visstat()` applies Yate's continuity correction [@Yates:1934] when computing the $\chi^2$ test statistic. In the following example, we restrict the data to participants with either black or brown hair and either brown or blue eyes, resulting in a 2-by-2 contingency table. ```{r} hair_black_brown_eyes_brown_blue <- HairEyeColor[1:2, 1:2, ] # Transform to data frame hair_black_brown_eyes_brown_blue_df <- counts_to_cases(as.data.frame(hair_black_brown_eyes_brown_blue)) # Chi-squared test visstat(hair_black_brown_eyes_brown_blue_df, "Hair", "Eye") ``` ## Fisher's exact test and mosaic plot with Pearson residuals Again, we extract a 2-by-2 contingency table from the full dataset, this time keeping only male participants with black or brown hair and hazel or green eyes. Pearson's ${\chi}^2$ test applied to this table would yield an expected frequency less than 5 in one of the four cells (25% of all cells), which violates the requirement that at least 80% of the expected frequencies must be 5 or greater [@Cochran]. Therefore, `visstat()` automatically selects Fisher's exact test instead. ```{r} hair_eye_color_male <- HairEyeColor[, , 1] # Slice out a 2 by 2 contingency table black_brown_hazel_green_male <- hair_eye_color_male[1:2, 3:4] # Transform to data frame black_brown_hazel_green_male <- counts_to_cases(as.data.frame(black_brown_hazel_green_male)) # Fisher test fisher_stats <- visstat(black_brown_hazel_green_male, "Hair", "Eye") ``` # Saving the graphical output The generated graphics can be saved in any file format supported by `Cairo()`, including "png", "jpeg", "pdf", "svg", "ps", and "tiff". In the following example, we store the graphics in PNG format in the `plotDirectory` `tempdir()`. The file names reflect the statistical test used and the variable names involved. ```{r} visstat(black_brown_hazel_green_male, "Hair", "Eye", graphicsoutput = "png", plotDirectory = tempdir() ) ``` Remove the graphical output from `plotDirectory`: ```{r} file.remove(file.path(tempdir(), "chi_squared_or_fisher_Hair_Eye.png")) file.remove(file.path(tempdir(), "mosaic_complete_Hair_Eye.png")) ``` # Implemented tests ## Numerical response \~ categorical feature When the response is numerical and the feature is categorical, test of central tendencies are selected: `t.test()`, `wilcox.test()`, `aov()`, `oneway.test()`,`kruskal.test()` ### Normality assumption check `shapiro.test()` and `ad.test()` ### Homoscedasticity assumption check `bartlett.test()` ### Post-hoc tests - `TukeyHSD()` (for `aov()`and `oneway.test()`) - `pairwise.wilcox.test()` (for `kruskal.test()`) ## Numerical response \~ numerical feature When both the response and feature are numerical, a simple linear regression model is fitted: `lm()` ## Categorical response \~ categorical predictor When both variables are categorical, `visstat()` tests the null hypothesis of independence using one of the following: - `chisq.test()` (default for larger samples) - `fisher.test()` (used for small expected cell counts based on Cochran's rule) # Bibliography