The package visStatistics
enables rapid visualization
and statistical analysis of raw data by automatically selecting the
hypothesis test with the highest statistical power to evaluate the
relationship between a response (varsample
) and a feature
(varfactor
) within a data.frame
. Its core
function, visstat()
, generates a graph showing key
statistics from the selected test and returns the corresponding test
results, including any post-hoc analysis.
A minimal function call has the form
visstat(dataframe, varsample, varfactor)
.
The input data.frame
must be column-based, and the
arguments varsample
and varfactor
must be
character strings naming columns of the data.frame
. The
significance level \(\alpha\), used
throughout for hypothesis testing, is defined as
1 - conf.level
, where conf.level
is a
user-controllable argument (defaulting to 0.95
). Data of
class "numeric"
or "integer"
are referred to
in the remainder of this vignette as numerical, while data of class
"factor"
are referred to as categorical.
The choice of statistical tests performed by the function
visstat()
depends on whether the data are numerical or
categorical, the number of levels in the categorical variable, and the
distribution of the data. The function prioritizes interpretable visual
output and tests that remain valid under the assumptions evaluated
applying the following decision logic:
When the response is numerical and the predictor is categorical,
test of central tendencies are performed. If the categorical predictor
has two levels, Welch’s t-test (t.test()
) is used, if both
groups have more than 30 observations, relying on the central limit
theorem to justify its application regardless of underlying normality
(Lumley et al. 2002). For smaller samples,
group-wise normality is assessed using the Shapiro-Wilk test
(shapiro.test()
). If both groups return p-values greater
than \(\alpha\), Welch’s t-test is
applied; otherwise, the Wilcoxon rank-sum test
(wilcox.test()
) is used. For predictors with more than two
levels, an ANOVA model (aov()
) is initially fitted.
Residual normality is evaluated using both the Shapiro-Wilk test
(shapiro.test()
) and Anderson-Darling test
(ad.test()
), and normality is assumed if either test yields
\(p > \alpha\). If the residuals are
considered normal, Bartlett’s test (bartlett.test()
) is
used to assess homoscedasticity. When variances are homogeneous (\(p > \alpha\)), ANOVA is applied with
Tukey’s HSD (TukeyHSD()
) for post-hoc comparison. If
variances differ significantly (\(p \le
\alpha\)), Welch’s one-way test (oneway.test()
) is
used, also followed by Tukey’s HSD. If residuals are not normally
distributed according to both tests (\(p \le
\alpha\)), the Kruskal-Wallis test (kruskal.test()
)
is selected, followed by pairwise Wilcoxon tests
(pairwise.wilcox.test()
).
When both the response and predictor are numerical, a simple
linear regression model (lm()
) is fitted and analysed in
detail, including residual diagnostics, formal tests, and the plotting
of fitted values with confidence bands. Note that only one explanatory
variable is allowed, as the function is designed for two-dimensional
visualisation.
In the case of two categorical variables, visstat()
tests the null hypothesis that the predictor and response variables are
independent using either chisq.test()
or
fisher.test()
. The choice of test is based on Cochran’s
rule (Cochran 1954), which advises that
the \(\chi^2\) approximation is
reliable only if no expected cell count is zero and no more than 20
percent of cells have expected counts below 5.
Note: Except for the user-adjustable conf.level
parameter, all statistical tests are applied using their default
settings from the corresponding base R functions (e.g.,
t.test()
). As a consequence, paired tests are not currently
supported. Furthermore, since the main purpose of this package is to
visualize statistical test results, only simple linear regression is
implemented.
If the feature consists of class
“factor
”
with two or more levels and the response consists is of
class
“numeric
” or “integer
”
(both having mode “numeric
”), statistical tests are applied
to compare the central tendencies across groups. This section describes
the conditions under which parametric and non-parametric tests are
chosen, based on the response type, the number of factor levels, and the
underlying distributional assumptions. A graphical overview of the
decision logic used is provided in below figure.
Decision tree used to select the appropriate statistical test for a categorical predictor and numerical response, based on the number of factor levels, normality and homoscedasticity.
When the feature variable has exactly two levels, Welch’s t-test is used to compare group means if the assumption of normality is satisfied. Otherwise, the Wilcoxon rank sum test is applied.
Welch’s t-test evaluates the null hypothesis that the two groups have equal means. Unlike Student’s t-test, which assumes equal variances (homoscedasticity), Welch’s test does not require this assumption. Welch’s test exhibits minimal loss of robustness even when Student’s t-test assumptions are satisfied (Moser and Stevens 1992; Delacre et al. 2017). Therefore, Student’s t-test is not implemented.
The Wilcoxon rank sum test, in contrast, is a non-parametric method that compares two distributions without requiring normality.
The function selects between these tests as follows. If both groups
have more than 30 observations, Welch’s t-test (t.test()
)
is always performed. For large samples, the test remains robust even
when normality is not strictly met (Rasch,
Kubinger, and Moder 2011; Lumley et al. 2002).
If either group has fewer than 30 observations, the function applies
the Shapiro-Wilk test (shapiro.test()
) to each group. If
both \(p\)-values exceed the
significance level \(\alpha = 1 -\)
conf.level
, Welch’s t-test is applied. If at least one
\(p\)-value falls below \(\alpha\), the Wilcoxon rank sum test
(wilcox.test()
) is used instead.
The graphical output consists of box plots overlaid with jittered
points to display individual observations. When Welch’s t-test is
applied, the function includes confidence intervals based on the
user-specified conf.level
. The function returns a list
containing the results of the applied test and the summary statistics
used to construct the plot.
As an example, we use the Motor Trend Car Road Tests data
set (mtcars
), which contains 32 observations. In the
example below, mpg
denotes miles per (US) gallon, and
am
represents the transmission type (0
=
automatic, 1
= manual).
Increasing the confidence level conf.level
from the
default 0.95 to 0.99 results in wider confidence intervals, as a higher
confidence level requires more conservative bounds to ensure that the
interval includes the true parameter value with greater certainty.
mtcars$am <- as.factor(mtcars$am)
t_test_statistics_99 <- visstat(mtcars, "mpg", "am", conf.level = 0.99)
grades_gender <- data.frame(
sex = as.factor(c(rep("girl", 21), rep("boy", 23))),
grade = c(
19.3, 18.1, 15.2, 18.3, 7.9, 6.2, 19.4,
20.3, 9.3, 11.3, 18.2, 17.5, 10.2, 20.1, 13.3, 17.2, 15.1, 16.2, 17.0,
16.5, 5.1, 15.3, 17.1, 14.8, 15.4, 14.4, 7.5, 15.5, 6.0, 17.4,
7.3, 14.3, 13.5, 8.0, 19.5, 13.4, 17.9, 17.7, 16.4, 15.6, 17.3, 19.9, 4.4, 2.1
)
)
wilcoxon_statistics <- visstat(grades_gender, "grade", "sex")
If the feature is of class
“factor
” with
more than two levels and the response is of mode
“numeric
”, visstat()
initially attempts an
analysis of variance (ANOVA). ANOVA is performed only if both of the
following null hypotheses cannot be rejected at the specified
conf.level
:
If only the assumption of normality is satisfied,
visstat()
applies Welch’s one-way test
(oneway.test()
). If the normality assumption is violated, a
Kruskal-W allis test (kruskal.test()
) is used instead.
These assumptions are tested internally using the
visAnovaAssumptions()
function.
The visAnovaAssumptions()
function assesses the
normality of standardized residuals from the ANOVA fit using both the
Shapiro-W ilk test (shapiro.test()
) and the Anderson-D
arling test (ad.test()
). Normality is assumed if at least
one of the two tests yields a \(p\)-value greater than \(1 -\) conf.level
.
The function generates two diagnostic plots:
- a scatterplot of the standardized residuals against the fitted means
of the linear model for each level of the feature
(varfactor
), and
- a Q-Q plot of the standardized residuals.
Both aov()
and oneway.test()
assess whether
two or more samples drawn from normal distributions have the same mean.
While aov()
assumes homogeneity of variances across groups,
oneway.test()
does not require the variances to be
equal.
Homoscedasticity is assessed using
Bartletts test (
bartlett.test()`), which tests the null
hypothesis that the variances across all levels of the grouping variable
are equal.
Depending on the \(p\)-value
returned by bartlett.test()
, the corresponding test is
selected and its \(p\)-value is
displayed in the figure title:
bartlett.test()
is greater than \(1 -\) conf.level
, we assume
homogeneity of variances across groups and report the \(p\)-value from the ANOVA
(aov()
).oneway.test()
).Simple pairwise comparisons of group means following an analysis of variance increase the probability of incorrectly declaring a significant difference when, in fact, there is none .
This inflation of error is quantified by the family-wise error rate, which refers to the probability of making at least one Type I error, that is, falsely rejecting the null hypothesis across all pairwise comparisons.
To control it, visstat()
performs post-hoc analysis
using Tukey’s Honestly Significant Difference (HSD) test and displays
Sidak-corrected confidence intervals.
The visstat()
function controls the probability of a
Type I error by applying Tukey`s Honestly Significant Differences
procedure, as implemented in TukeyHSD()
.
Based on the specified confidence level (conf.level
), it
constructs a set of confidence intervals for all pairwise differences
between factor level means. A significant difference between two means
is indicated when the corresponding confidence interval does not include
zero.
The function returns both the HSD-adjusted \(p\)-values and the associated confidence
intervals for all pairwise comparisons.
In the graphical output for the one-way test and ANOVA, green letters
displayed below each group summarize the results of the Tukey HSD
post-hoc test: Two groups are considered significantly different if they
are assigned different letters, indicating a Tukey HSD-adjusted \(p\)-value smaller than \(\alpha = 1 -\) conf.level
.
Tukey`s HSD procedure is based on pairwise comparisons of the differences between the means at each factor level and produces a set of corresponding confidence intervals. The Sidak procedure, on the other hand, addresses the problems of a type I error by lowering the acceptable probability of a type I error for all comparisons of the levels of the independent, categorical variable.
The Sidak corrected acceptable probability of error (Šidák 1967) is defined as \(\alpha_{Sidak}=(1\)-conf.int
\()^{1/M}\), where \(M=\frac{n\cdot (n-1)}{2}\) is the number of
pairwise comparisons of the \(n\)
levels of the categorical variable.
In the graphical display of One-way test and ANOVA,
visstat()
displays both the conf.level
\(\cdot\; 100 \%\)-confidence intervals
alongside the larger, Sidak-corrected \((1-\alpha_{Sidak})\cdot
100\;\%\)-confidence intervals.
Note that the current structure of visstat()
does not
allow the study of interactions between the different levels of an
independent variable.
If the \(p\)-value from the
Shapiro-Wilk test (shapiro.test()
) applied to the
standardized residuals is smaller than the significance level \(1 -\) conf.level
,
visstat()
selects a non-parametric alternative: the
Kruskal-Wallis rank sum test.
The function kruskal.test()
tests the null hypothesis
that the group medians are equal across all levels of the categorical
feature.
pairwise.wilcox.test()
As post-hoc analysis following the Kruskal-Wallis test,
visstat()
applies the pairwise Wilcoxon rank sum test using
pairwise.wilcox.test()
, with Holm-s method as the default
adjustment for multiple comparisons (Holm
1979).
If the Holm-adjusted \(p\)-value for
a given pair of groups is smaller than the significance level \(1 -\) conf.level
, the green
letters displayed below the corresponding box plots will differ.
Otherwise, the groups are considered not significantly different.
Apart from the multiple comparison adjustment, the graphical representation of the Kruskal-Wallis result is similar to that used for the Wilcoxon rank sum test.
The function returns a list containing the test statistic from the Kruskal-Wallis rank sum test, along with the Holm-adjusted \(p\)-values for all pairwise comparisons.
The npk
dataset reports the yield of peas (in pounds per
block) from an agricultural experiment conducted on six blocks. In this
experiment, the application of three different fertilisers - nitrogen
(N), phosphate (P), and potassium (K) - was varied systematically. Each
block received either none, one, two, or all three of the
fertilisers,
Normality of residuals is supported by graphical diagnostics (scatter
plot of standardized residuals, Q-Q plot) and formal tests (Shapiro-Wilk
and Anderson-Darling, both with \(p >
\alpha\)). However, homogeneity of variances is not supported at
the given confidence level (\(p <
\alpha\), bartlett.test()
), so the \(p\)-value from the variance-robust
oneway.test()
is reported. Post-hoc analysis with
TukeyHSD()
shows no significant yield differences between
blocks, as all share the same group label (e.g., all green letters).
The InsectSprays
dataset reports insect counts from
agricultural experimental units treated with six different insecticides.
To stabilise the variance in counts, we apply a square root
transformation to the response variable.
insect_sprays_tr <- InsectSprays
insect_sprays_tr$count_sqrt <- sqrt(InsectSprays$count)
visstat(insect_sprays_tr, "count_sqrt", "spray")
After the transformation, the homogeneity of variances can be assumed
(\(p> \alpha\) as calculated with
the bartlett.test()
), and the \(p\)-value of the aov()
is
displayed.
The iris
dataset contains petal width measurements (in
cm) for three different iris species.
In this example, scatter plots of the standardised residuals and the Q-Q plot suggest that the residuals are not normally distributed. This is confirmed by very small \(p\)-values from both the Shapiro-Wilk and Anderson-Darling tests.
If both \(p\)-values are below the
significance level \(\alpha = 1
-\)conf.level, visstat()
switches to the
non-parametric kruskal.test()
. Post-hoc analysis using
pairwise.wilcox.test()
shows significant differences in
petal width between all three species, as indicated by distinct group
labels (all green letters differ).
If the feature varfactor
and the response
varsample
have only one level of type
numerical
or integer
, visstat()
performs a simple linear regression. Note that multiple linear
regression is not implemented, as the main focus of
visstat()
is the visualisation of statistical tests.
If the feature varfactor
and the response
varsample
contain only one level each and both are of type
numeric
or integer
, visstat()
performs a simple linear regression. Note that multiple linear
regression is not implemented, as the main focus of
visstat()
is on the visualisation of statistical tests.
visstat()
checks the normality of the standardised
residuals from lm()
both graphically and using the
Shapiro-Wilk and Anderson-Darling tests. If the \(p\)-values for the null hypothesis of
normally distributed residuals from both tests are smaller than \(1 -\)conf.int
, the title of
the residual plot will display the message: “Requirement of normally
distributed residuals not met”.
Regardless of the result of the residual analysis,
visstat()
proceeds to perform the regression. The title of
the graphical output indicates the chosen confidence level
(conf.level
), the estimated regression parameters with
their confidence intervals and \(p\)-values, and the adjusted \(R^2\). The plot displays the raw data, the
fitted regression line, and both the confidence and prediction bands
corresponding to the specified conf.level
.
visstat()
returns a list containing the regression test
statistics, the \(p\)-values from the
normality tests of the standardised residuals, and the pointwise
estimates of the confidence and prediction bands.
cars
The cars
dataset reports the speed of cars in miles per
hour (speed
) and the stopping distance in feet
(dist
).
Increasing the confidence level conf.level
from the
default 0.95 to 0.99 results in wider confidence and prediction
bands:
\(p\)-values greater than
conf.level
in both Anderson-Darling normality test and the
Shapiro-Wilk test of the standardised residuals indicate that the
normality assumption of the residuals underlying the linear regression
is met.
trees
The trees
dataset provides measurements of the diameter
(Girth
, in inches), Height
(in feet), and
Volume
(in cubic feet) of black cherry trees. In this
example, we choose Volume
as the response and
Girth
as the feature.
Both the graphical analysis of the standardised residuals and \(p\)-values greater than \(1 -\)conf.level in the Anderson-Darling and
Shapiro-Wilk tests suggest that the assumption of normally distributed
residuals is met. Furthermore, the linear regression model explains 93%
of the total variance in the response variable Volume
.
When both the feature varfactor
and the response
varsample
are categorical variables (i.e., of type
factor
), visstat()
tests the null hypothesis
that the two variables are independent. Categorical data are typically
represented as contingency tables, which cross-tabulate the observed
frequencies for each combination of factor levels. Based on these
observed frequencies, visstat()
computes the expected
frequencies under the null hypothesis of independence.
If the expected frequencies are sufficiently large - specifically, if
at least 80% of the cells have expected counts greater than 5 and no
expected count is zero - the function uses Pearson’s \({\chi}^2\)-test
(chisq.test()
). Otherwise, it switches to Fisher’s exact
test (fisher.test()
) (Cochran
1954). For 2-by-2 contingency tables, Yate’s a continuity
correction (Yates 1934) is applied to
Pearson’s \({\chi}^2\)-test.
For both \({\chi}^2\)- and Fisher
Test ‘visstat()’ displays a grouped column plot that includes the
respective test’s \(p\)-value in the
title, as well as a mosaic plot showing Pearson residuals (see the
documentation of mosaic()
in the vcd
package
for details).
All features of visstat()
will be illustrated using the
built-in data set HairEyeColor
.
data.frame
visstat()
requires input in the form of a column-based
data.frame
. Contingency tables can be converted to this
format using the helper function counts_to_cases()
.
counts_to_cases()
transforms the contingency table
HairEyeColor
into data.frame
named
HairEyeColorDataFrame
.
hair_eye_color_df <- counts_to_cases(as.data.frame(HairEyeColor))
visstat(hair_eye_color_df, "Hair", "Eye")
For 2-by-2 contingency tables, visstat()
applies Yate’s
continuity correction (Yates 1934) when
computing the \(\chi^2\) test
statistic. In the following example, we restrict the data to
participants with either black or brown hair and either brown or blue
eyes, resulting in a 2-by-2 contingency table.
hair_black_brown_eyes_brown_blue <- HairEyeColor[1:2, 1:2, ]
# Transform to data frame
hair_black_brown_eyes_brown_blue_df <- counts_to_cases(as.data.frame(hair_black_brown_eyes_brown_blue))
# Chi-squared test
visstat(hair_black_brown_eyes_brown_blue_df, "Hair", "Eye")
Again, we extract a 2-by-2 contingency table from the full dataset, this time keeping only male participants with black or brown hair and hazel or green eyes.
Pearson’s \({\chi}^2\) test applied to this table would yield an expected frequency less than 5 in one of the four cells (25% of all cells), which violates the requirement that at least 80% of the expected frequencies must be 5 or greater (Cochran 1954).
Therefore, visstat()
automatically selects Fisher’s
exact test instead.
hair_eye_color_male <- HairEyeColor[, , 1]
# Slice out a 2 by 2 contingency table
black_brown_hazel_green_male <- hair_eye_color_male[1:2, 3:4]
# Transform to data frame
black_brown_hazel_green_male <- counts_to_cases(as.data.frame(black_brown_hazel_green_male))
# Fisher test
fisher_stats <- visstat(black_brown_hazel_green_male, "Hair", "Eye")
The generated graphics can be saved in any file format supported by
Cairo()
, including “png”, “jpeg”, “pdf”, “svg”, “ps”, and
“tiff”. In the following example, we store the graphics in PNG format in
the plotDirectory
tempdir()
. The file names
reflect the statistical test used and the variable names involved.
visstat(black_brown_hazel_green_male, "Hair", "Eye",
graphicsoutput = "png", plotDirectory =
tempdir()
)
Remove the graphical output from plotDirectory
:
When the response is numerical and the feature is categorical, test of central tendencies are selected:
t.test()
, wilcox.test()
,
aov()
,
oneway.test()
,kruskal.test()
shapiro.test()
and ad.test()
bartlett.test()
TukeyHSD()
(for aov()
and
oneway.test()
)pairwise.wilcox.test()
(for
kruskal.test()
)When both the response and feature are numerical, a simple linear regression model is fitted:
lm()
When both variables are categorical, visstat()
tests the
null hypothesis of independence using one of the following:
chisq.test()
(default for larger samples)fisher.test()
(used for small expected cell counts
based on Cochran’s rule)