--- title: "1. Data: Summary Statistics with a Pivot Table" author: "David Gerbing" output: rmarkdown::html_vignette: toc: true toc_depth: 4 vignette: > %\VignetteIndexEntry{1. Data: Summary Statistics with a Pivot Table} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r include=FALSE} suppressPackageStartupMessages(library("lessR")) ``` ```{r, include=FALSE} knitr::opts_chunk$set(fig.width=3.5, fig.height=3) ``` ## Overview ### Descriptive Statistics Descriptive summarize characteristics of the sample data values for one or more variables. The __lessR__ `pivot()` function serves as a single source for a wide variety and types of descriptive statistics for one or more variables communicated in the form of a _pivot table_. Compute statistics either for the entire data set at once or separately for different groups of data. >_Aggregation_: From the values of categorical variables, form groups of data and then compute one or more statistics for each group. As an example of aggregation, compute the mean Years worked for each combination of the levels of _Gender_ and _Dept_ (department) of employment. To introduce the ease of use of `pivot()`, and to provide context for the following discussion, the function call listed below includes the parameter names (in red) for emphasis. However, if the parameters are entered in this order, listing their names is not necessary. Enter multiple values, such as for _Gender_ and _Dept_ as an R vector, defined such as with the base R combine `c()` function. The output of the analysis is shown later. ```{r pivot, dataTable, echo=FALSE, out.width='92%', fig.align='center', out.extra='style="border-style: none"'} knitr::include_graphics(system.file("img", "pivot.png", package="lessR")) ``` The `pivot()` function computes statistics for three classes of variables: - Numerical variables that represent continuity such as time, money, and weight. - Numerical variables that represent ordered categories, such as Likert scale responses Strongly Disagree to Strongly Agree on a 5-pt scale coded 1 to 5. - Non-numerical categorical variables such as Gender encoded as Male, Female, and Other, or coded as integers but without numerical properties, such as 0, 1, and 2 for three values of Gender. Any function that processes a single vector of data, such as a column of data values for a variable in a data frame, and outputs a single computed value, the statistic, can be passed to `pivot()`. The phrase "any function" is quite general, including any function in any active package as well as user-defined functions that satisfy these criteria. The function accessed by `pivot()` can be one of the many available R statistical functions that summarize data, including those in the following table. Statistic | Meaning ----------- | ------------------------ `sum` | sum `mean` | arithmetic mean `median` | median `min` | minimum `max` | maximum `sd` | standard deviation `var` | variance `skew` | skew `kurtosis` | kurtosis `IQR` | inter-quartile range `mad` | mean absolute deviation The statistics `skew` and `kurtosis` have no counterparts in base R, so are provided by __lessR__. Computations of all other statistics follow from base R functions. The `quantile` and `table` computations return multiple values. They are accessible to `pivot()` only because of internal programming that recognizes these functions and then processes the resulting multiple output values. Statistic | Meaning ----------- | ------------------------- `quantile` | min, quantiles, max `table` | cell counts or proportions The default quantiles for `quantile` are quartiles. Specify a custom number of quantiles with the `q_num` parameter other than 4 for quartiles. The `table` computation applies to an aggregated variable that consists of discrete categories, such as the numbers 1 through 5 for responses to a 5-pt Likert scale. The result is a table of frequencies or proportions, referred to for two or more variables as either a contingency table, a cross-tabulation table, or a joint frequency distribution. Only the `table` computation applies to non-numeric as well as numeric variables, though only meaningful if the aggregated variable consists of a relatively small set of discrete values, character strings or numeric. When calculating a statistic, the analyst should be aware of the present and missing (not available or `NA`) data that underlies the computed statistic. Accordingly, `pivot()` by default provides the sample size information for each calculation. Turn off the display of information by setting the `show_n` parameter to `FALSE`. ### Parameters The following `pivot()` parameters specify the data, one or more statistics to compute for the aggregation, the variables over which to aggregate, and the corresponding groups that contain the aggregated values. The first two parameter values listed below are required: the data frame and at least one statistic to compute. If there is a numeric variable to be aggregated, then least one variable for which to compute the statistic(s) is also required. 1. `data`: The data frame that includes the variables of interest. 1. `compute`: One or more functions that specify the corresponding statistics to compute. 1. `variable`: One or more numerical variables to summarize, either by aggregation over groups or the entire sample as a single group. 1. `by`: Specify the optional aggregation according to the categorical variable(s) that define the groups. 1. `by_cols`: The optional categorical variable(s) that define the groups or cells for which to compute the aggregated values, listed as columns in a two-dimensional table instead of a data frame. For the given `data`, `compute` at least one statistic for at least one `variable`. If aggregating across groups, compute the statistic(s) for each group specified by `by` and possibly `by_cols`. If no categorical (`by`) variables are selected to define groups, then one or more statistics are computed over multiple variables over the entire data table defined as a single group. For categorical (grouping) variables, defined with `by` and optionally `by_cols`, choose to either `compute` multiple statistics or multiple `variable`(s), but not both. > Select any two of the three possibilities for multiple parameter values: Multiple compute functions, multiple variables over which to compute, and multiple categorical variables by which to define groups for aggregation. Specify multiple descriptive statistics to compute, multiple values for which to do the computation, and multiple categorical variables to define groups as vectors such as with the base R `c()` function. ### Output The default output of `pivot()` is a data frame, a two-dimensional table, rows and columns. The output table can have multiple rows and multiple columns according to the choice of parameter values. For each numerical variable in the analysis, `pivot()` displays both the corresponding sample size as `n_` and amount of missing or not available data as `na_`. Prevent displaying sample size information by setting parameter `show_n` to `FALSE`. The output is structured according to the following. - The default form is a data frame with the categorical variables in the analysis listed as columns first, followed by the numerical variables, ready for input into data analysis procedures. Each row of the output data frame consists of the corresponding values for the levels of the categorical variables and the corresponding values of any aggregated statistics. - If one or two categorical variables are specified with the `by_cols` parameter, the output is a table instead of a data frame with the specified categorical variables in the columns, amenable for viewing. - If there is no aggregation specified, no `by` variables, the corresponding statistics are computed over the entire input data frame defined as a single group. - There may be many variables to summarize, such as all the items on a multi-item scale from a survey analysis, so the output is a data frame. The variable names have appended an indication of the corresponding computed statistic and the rows defined by the combinations of levels of the `by` variables. ## Output as a Long-Form Data Frame To illustrate, use the 37-row Employee data set included with __lessR__, here read into the _d_ data frame. ```{r} d <- Read("Employee") ``` Two categorical variables in the _d_ data frame are _Dept_ and _Gender_, encoded as character strings. Continuous variables include _Years_ employed at the company and annual _Salary_. ### Multiple Grouping Variables This example does the analysis previously illustrated. The name of the computed mean of _Years_ by default is *Years_mean*. With no `by_cols` variables, the output of `pivot()` is a data frame of the aggregated variables. Optionally save this output for further analysis, as the data frame _a_ for aggregation in this example. ```{r} a <- pivot(data=d, compute=mean, variable=Years, by=c(Dept, Gender)) a ``` Using the __lessR__ function `Read()`, can read the original data table from which the pivot table was constructed, such as in the form of an Excel worksheet. For many analyses, easier to read the Excel data into R and do the analysis in R than remain in Excel. The result can also be written back into an Excel file with the __lessR__ function `Write()`. In this example, create an Excel file called _MyPivotTable.xlsx_ from the pivot table stored in the _a_ data frame. To avoid creating this file, the function call is commented out with the `#` symbol in the first column. ```{r} #Write("MyPivotTable", data=a, format="Excel") ``` The abbreviation `wrt_x()` for the function name simplifies the preceding expression for writing to an Excel file, with the `format` parameter dropped. ### Aggregate with Multiple Statistics In this next example, specify multiple statistics for which to aggregate for each group for the specified value variable _Salary_. For each group, compare the mean to the median, and the standard deviation to the interquartile range. By default, each column of an aggregated statistic is the `variable` name, here _Salary_, followed by a "\_", then either the name of the statistic or an abbreviation. The respective abbreviations for `mean` and `median` are `mn` and `md`. ```{r} pivot(d, c(mean, median, sd, IQR), Salary, Dept) ``` Also have available two functions provided by __lessR__ that are not part of base R: `skew()` and `kurtosis()`. ```{r} pivot(d, c(mean,sd,skew,kurtosis), Salary, Dept, digits_d=3) ``` Can also customize the variable names of the aggregated statistics with the `out_names` parameter. Here calculate the mean and median _Salary_ for each group defined by each combination of levels for _Gender_ and _Dept_. ```{r} pivot(d, c(mean, median), Salary, c(Gender,Dept), out_names=c("MeanSalary", "MedianSalary")) ``` ### Aggregate Multiple Variables The `pivot()` function can aggregate over multiple `variables`. Here, aggregate _Years_ and _Salary_. Round the numerical aggregated results to the nearest integer with the `digits_d` parameter, which specifies the number of decimal digits in the output. Different variables can have different amounts of missing data, so the sample size, _n_, and number missing, the number of values Not Available, _na_, are listed separately for each aggregated variable. ```{r} pivot(d, mean, c(Years, Salary), c(Dept, Gender), digits_d=0) ``` Turn off the display of the sample size and number of missing values for each group by setting the parameter `show_n` to `FALSE`. ```{r} pivot(d, mean, c(Years, Salary), Dept, digits_d=2, show_n=FALSE) ``` ### Compute over All Data Aggregation computes one or more statistics for one or more variables across different groups. Or, compute the variables for each statistic for _all_ the data. To compute over all the rows of data, do not specify groups, that is, drop the `by` parameter. Get the grand mean of Years, that is, for all the data. ```{r} pivot(d, mean, Years) ``` Get the grand mean of Years and Salary. Specify custom names for the results. ```{r} pivot(d, mean, c(Years, Salary), digits_d=2, out_names=c("MeanYear", "MeanSalary")) ``` Consider an example with more variables. Analyze the 6-pt Likert scale responses to the Mach IV scale that assesses Machiavellianism. Items are scored from 0 to 5, Strongly Disagree to Strongly Agree. The data are included with __lessR__ as the _Mach4_ data file. Suppress output when reading by setting `quiet` to `TRUE`. ```{r} d <- Read("Mach4", quiet=TRUE) ``` Calculate the mean, standard deviation, skew, and kurtosis for all of the data for each of the 20 items on the scale. With this specification, the form of a data frame is statistics in the columns and the variables in the rows. The result are the specified summary statistics for the specified variables over the entire data set. ```{r} pivot(d, c(mean,sd,skew,kurtosis), m01:m20) ``` ### Frequency Tables Aggregating a statistical computation of a continuous variable over groups with `pivot()`, such as computing the mean for each combination of _Dept_ and _Gender_, by default includes the tabulation for each group (cell). A tabulation can be requested with no analysis of a variable, instead only a counting of the available levels of the specified categorical variables. The `table` value of the `compute` parameter specifies to compute the frequency table for a categorical aggregated `variable` across all combinations of the `by` variables. Specify one categorical `variable` with the remaining categorical variables specified with the `by` parameter. Missing values for each combination of the levels of the grouping variables are displayed. Begin with a one-way frequency table computed over the entire data set. ```{r} pivot(d, table, by=m06) ``` In this example, compute a two-way cross-tabulation table with the levels of `variable` _m06_ as columns and the levels of the `by` categorical variable _m07_ as rows. ```{r} pivot(d, table, by=c(m06, m07)) ``` To output the data in long form, one tabulation per row, set the `table_long` parameter to TRUE. If interested in the inferential analysis of the cross-tabulation table, access the __lessR__ function `Prop_test()` to obtain both the descriptive and inferential results, though limited to a one- or two-way table. The default data table is _d_, but included explicitly in the following example to illustrate the `data` parameter. ```{r} Prop_test(m06, by=m07, data=d) ``` Can also aggregate other statistics simultaneously in addition to the frequency table, though, of course, only meaningful if the aggregated variable is numerical. Here, create a 3-way cross-tabulation table with responses to `variable` _m06_ in the column and responses to `by` variables _m07_ and _m10_ in the rows, plus the mean and standard deviation of each combination of _m07_ and _m10_ across levels of _m06_. ```{r} pivot(d, c(mean,sd), m06, c(m07, m10)) ``` Return to the _Employee_ data set for the remaining examples. ```{r} d <- Read("Employee", quiet=TRUE) ``` ### Quantiles One way to understand the characteristics of a distribution of data values of a continuous variable is to sort the values and then split into equal-sized groups. The simplest example is the median, which splits a sorted distribution into two groups of equal size, the bottom lowest values and the top highest values. Quartiles divide the sorted distribution into four groups. The first quartile is the smallest 25% of the data values, etc. > _Quantiles_: Numeric values, not necessarily data values, that divide a distribution of sorted data values into _n_ groups of equal size. By default, calling the quantile function computes quartiles. Here calculate the quartiles for _Years_ aggregated across levels of _Dept_ and _Gender_. ```{r} pivot(d, quantile, Years, c(Dept, Gender)) ``` To compute quantiles other than quartiles, invoke the `q_num` parameter, the number of quantile intervals. The default value is 4 for quartiles. In the following example, compute the quintiles for _Years_ and _Salary_, plus the mean and standard deviation. No specification of `by`, so these descriptive statistics are computed over the entire data set for both specified variables. ```{r} pivot(d, c(mean,sd,quantile), c(Years,Salary), q_num=5, digits_d=2) ``` ## Other Features ### Drill-Down One data analysis strategy examines the values of a variable, such as Sales for a business or Mortality for an epidemiology study, at progressively finer levels of detail. Examine by Country or State or City or whatever level of granularity is appropriate. > _Data drill down_: Examine the values of a variable when holding the values of one or more categorical variables constant. `pivot()` indicates drilling down deeper into the data with a display of all categorical variables with unique values, which precedes the primary output. Initiate the drill-down by a previous subset of the data frame, or indicate the subset with `pivot()` directly. As with other __lessR__ analysis functions, the `rows` parameter specifies a logical condition for which to subset rows of the data frame for analysis. In this example, compute the mean of _Salary_ for each level of _Dept_ for just those rows of data with the value of _Gender_ equal to "W". ```{r} pivot(d, mean, Salary, Dept, filter=(Gender=="W")) ``` The parentheses for the `rows` parameter are not necessary, but do enhance readability. Can also drill down by subsetting the data frame with a logical condition directly in the data parameter in the call to `pivot()`. Here drill down with base R `Extract[ ]` in conjunction with the __lessR__ function `.()` to simplify the syntax (explained in the vignette Subset a Data Frame). The `Extract[ ]` function specifies the rows of the data frame to extract before the comma, and the columns to extract after the comma. Here select only those rows of data with _Gender_ declared as Female. There is no information after the comma, so no columns are specified, which means to retain all columns, the variables in the data frame. ```{r} pivot(d[.(Gender=="W"),], mean, Salary, Dept) ``` ### Sort Output Specify the sort as part of the call to `pivot()` with the parameter `sort`. This internal sort works for a single `value` variable, by default the last column in the output data frame. Set to `"-"` for a descending sort. Set to `"+"` for an ascending sort. ```{r} pivot(d, mean, Salary, c(Dept, Gender), sort="-") ``` Specify the `sort_var` parameter to specify the name or the column number of the variable to sort. ```{r} pivot(d, c(mean, median), Salary, c(Gender,Dept), out_names=c("MeanSalary", "MedianSalary"), sort="-", sort_var="MeanSalary") ``` Because the output of `pivot()` with no `by_cols` variables is a standard R data frame, the external call to the __lessR__ function `order_by()` is available for custom sorting by one or more variables. Sort in the specified direction with the `direction` parameter. ```{r} a <- pivot(d, mean, Salary, c(Dept, Gender)) order_by(a, by=Salary_mean, direction="-") ``` Specify multiple variables to sort with a vector of variable names, and a corresponding vector of `"+"` and `"-"` signs of the same length for the `directions` parameter. ### Pipe Operator The following illustrates as of R 4.1.0 the base R pipe operator `|>` with `pivot()`. The pipe operator by default inserts the object on the left-hand side of an expression into the first parameter value for the function on the right-hand side. In this example, input the _d_ data frame into the first parameter of `pivot()`, the `data` parameter. Then direct the output to the data frame _a_ with the standard R assignment statement, though written pointing to the right hand side of the expression. To avoid problems installing this version of __lessR__ from source with a previous version of R, the code is commented out with a `#` sign in the first column. ```{r} #d |> pivot(mean, Salary, c(Dept, Gender)) -> a #a ``` ## Output as a 2-D Table Specify up to two `by_cols` categorical variables to create a two-dimensional table with the specified columns. Specifying one or two categorical variables as `by_cols` variables moves them from their default position in the rows to the columns, which changes the output structure from a long-form data frame to a table with categorical variables in the rows and columns. In this example, specify _Gender_ as the `by_cols` variable. ```{r} pivot(d, mean, Salary, by=Dept, by_cols=Gender) ``` Here two `by_cols` variables, specified as a vector. There is much missing data for this three-way classification as there is not much data in each group, with many groups having no data. ```{r} pivot(d, mean, Salary, Dept, c(Gender, Plan)) ``` ## Missing Data There are three different types of missing data in an aggregation: _some_ missing values for one or more aggregated `variable`s for which the statistic is computed, _all_ missing values for the `variable` so that the group (cell) defined by one or more `by` variables has no data values, and the data value for the `by` variable itself is missing. Accordingly, there are three different missing data parameters with `pivot()`. - `na_remove`: Remove any missing data from a value of the `variable`, then perform the aggregation on the remaining values, reporting how many values are missing. Otherwise report the computed statistic as NA (missing). [default is `TRUE`] - `na_by_show`: If all values of `variable` are missing for a group so that the entire level of the `by` variables is missing, show those missing cells with a reported value of computed variable _n_ as 0. Otherwise delete the row from the output. [default is `TRUE`] - `na_group_show`: The `by` data value is missing so no data is available for any of the aggregated statistics, all displayed as missing values. Otherwise the row with the missing `by` data value is deleted from the output. [default is `TRUE`] ### Some Missing Data at a Level of a `by` Variable By default, the data frame output of `pivot()` lists the number of occurrences of missing data for each group of the variable over which the statistic is computed. The `na_remove` parameter specifies if NA (missing) values of the aggregated variable should be removed before the computation proceeds. The default value `na_remove` is `TRUE`, so the missing values are removed, their number of occurrences reported, and then the computed statistic displayed. In this example, the variable _Years_ has one missing value, which occurs in the Sales department. That value is dropped from the computation of the mean of Years, which is based on the 14 non-missing values. ```{r} pivot(d, mean, Years, Dept) ``` Set `na_remove` to `FALSE` to _not_ remove any missing data in a cell with values to be aggregated. According to the way in which R functions process the data with missing values entered into the function, one missing value implies that the computed statistic has a missing value. ```{r} pivot(d, mean, Years, Dept, na_remove=FALSE) ``` ### All Missing Data at a Level of a `by` Variable Another possibility accounts for missing values of the categorical `by` variables for which all values of the variable for which to compute the statistic for that cell are missing. In that situation, there is no data value for which to compute the statistic. For this example, first create a cell with no values in the data aggregation. Use base R `Extract[]` with __lessR__ `.()` to drop the one male in the Sales department, leaving no data values for that group. Save the result into the _dd_ data frame. ```{r} dd <- d[.(!(Gender=="M" & Dept=="SALE")), ] ``` Also, to focus only on this type of missing data, remove the row of data that has a missing value for Dept, which is the third row. ```{r} head(dd) dd <- dd[-3,] ``` Here, explicitly set `na_by_show` to `TRUE`, the default value. The group for male sales employees is shown with the values of the computed variables *n_* and *na_* set to 0 and the values of the computed statistics necessarily missing for this row of the output data frame. ```{r} pivot(dd, c(mean,median), Salary, c(Dept, Gender), na_by_show=TRUE) ``` Drop the groups from the output with missing data for a `by` variable. To do so, set `na_by_show` to `FALSE`. Now the group for the non-existent male sales employee does not display. ```{r} pivot(dd, c(mean,median), Salary, c(Dept, Gender), na_by_show=FALSE) ``` ### Missing Data for a Level of the `by` Variable Itself The parameter `na_group_show` specifies if a value of the grouping variable itself should be displayed as level ``. By default, the `` levels are displayed but the `na_group_show` is explicitly included in the following. ```{r} pivot(d, c(mean,median), Salary, c(Dept, Gender), na_group_show=TRUE) ``` Set `na_group_show` to `FALSE` to not display as levels when a value of the `by` variable is missing. ```{r} pivot(d, c(mean,median), Salary, c(Dept, Gender), na_group_show=FALSE) ``` ## Build Your Own Function Any function that processes a single column of data and returns a single value, usually a statistic, can be accessed with `pivot()`. In this example, define a function named _mnmd()_ that computes the difference between the `mean` and `median` of a distribution of data values, here represented by `x`. For technical reasons, need to include the `NA` remove parameter, `na.rm`, in the function definition if there is any missing data. If missing data values are to be dropped in internal function calls, then set the base R parameter `na.rm` to `TRUE` for each component function as well. ```{r} mnmd <- function(x, na.rm=TRUE) mean(x, na.rm=na.rm) - median(x, na.rm=na.rm) ``` Invoke `pivot()` to compute the mean, the median, and their difference for each group in the aggregation of the variable _Years_. ```{r} pivot(d, c(mean, median, mnmd), Years, by=Dept) ``` ## __lessR__ `pivot()` vs __Base R__ `aggregate()` The __lessR__ `pivot()` function relies upon the __base R__ function `aggregate()` for aggregation. By default, except for the `table` computation, `pivot()` generates a long-form data frame pivot table (Excel terminology), which can then be directly input into analysis and visualization functions as a standard data frame. The levels across all the `by` grouping variables are listed in the rows. If there are specified column grouping variables according to `by_cols`, `pivot()` relies upon base R `reshape()` to form a 2-d table for direct viewing instead of a data table to input into further analysis functions. `pivot()` provides additional features than `aggregate()` provides. 1. For each `value` over which to aggregate, the sample size and number of missing values for each group is provided. 1. Multiple statistical functions can be selected for which to `compute` the aggregated value for each group. 1. Extends beyond aggregation to `compute` statistics over the entire data set instead of groups of data. 1. Missing data analysis by cell or by the `value` aggregated. 1. Aggregation not necessary, so can `compute` the specified statistic(s) for each `variable` across the entire data set. 1. The aggregated computations can be displayed as a 2-d table, not just a long-form data frame. 1. `by` variables of type `Date` retain the same variable type in the summary table instead of each converted to a `factor` (set `factors=FALSE` to not convert `integer` and `character` variables to factors). 1. The list of parameters lists the `data` parameter first, which facilitates the use of the pipe operator, such as from base R as of Version 4.1.0 or the __magrittr__ package. 1. Any non-numeric variables with unique values in the submitted `data` are listed with their corresponding data value, which identifies when drilling down into the data to study relevant rows. Although the `pivot()` function considerably extends the functionality of base R `aggregate()`, `pivot()` does rely upon this base R function for most of its computations.