--- title: "2. Visualize: Histogram and Related" author: "David Gerbing" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{2. Visualize: Histogram and Related} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r include=FALSE} suppressPackageStartupMessages(library("lessR")) ``` ```{r, include=FALSE} knitr::opts_chunk$set(fig.width=4.5, fig.height=3.5) ``` ## Data Most of the following examples are an analysis of data in the _Employee_ data set, included with __lessR__. First read the _Employee_ data into the data frame _d_. See the `Read and Write` vignette for more details. ```{r read} d <- Read("Employee") ``` As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need be entered into the table. The table can be a `csv` file or an Excel file. Read the label file into the _l_ data frame, currently the only permitted name. The labels will be displayed on both the text and visualization output. Each displayed label is the variable name juxtaposed with the corresponding label, as shown in the following output. ```{r labels} l <- rd("Employee_lbl") l ``` ## Histogram One of the most frequently encountered visualizations for continuous variables is the histogram, which outlines the general shape of the underlying distribution. >_Histogram_: Bin similar values into a group, then plot the frequency of occurrence of the data values in each bin proportional to the height of the corresponding bar. A call to a function to create a histogram contains the name of the continuous variable that contains the plotted values. With the `X()` function, that variable name is the first parameter value passed to the function. In this example, the _only_ parameter value passed to the function is the variable name. The data frame is named _d_, the default value. The following illustrates the call to `X()` with a continuous variable named $x$. ```{r hsform, dataTable, echo=FALSE, out.width='28%', fig.asp=.7, fig.align='center', out.extra='style="border-style: none"'} knitr::include_graphics(system.file("img", "hsExplain.png", package="lessR")) ``` To illustrate, consider the continuous variable _Salary_ in the Employee data table. Use `X()` to tabulate and display the number of employees in each department, here relying upon the default data frame (table) named _d_, so the `data=` parameter is not needed. ```{r hs, fig.width=4, fig.height=3.5, fig.align='center', fig.cap="Histogram of tablulated counts for the bins of Salary."} X(Salary) ``` By default, the `X()` function provides a color theme according to the current, active theme. The function also provides the corresponding frequency distribution, summary statistics, the table that lists the count of each category, from which the histogram is constructed, as well as an outlier analysis based on Tukey's outlier detection rules for box plots. ## Customize the Histogram ```{r} style(quiet=TRUE) ``` Use the parameters `bin_start`, `bin_width`, and `bin_end` to customize the histogram. ```{r binwidth, fig.width=4, fig.height=3.5, fig.align='center', fig.cap="Customized histogram."} X(Salary, bin_start=35000, bin_width=14000) ``` Easy to change the color, either by changing the color theme with `style()`, or just change the fill color with `fill`. Can refer to standard R colors, as shown with __lessR__ function `showColors()`, or implicitly invoke the __lessR__ color palette generating function `getColors()`. Each 30 degrees of the color wheel is named, such as `"greens"`, `"rusts"`, etc, and implements a sequential color palette. Use the `color` parameter to set the border color, here turned off. ```{r colors, fig.width=4, fig.height=3.5, fig.align='center', fig.cap="Customized histogram."} X(Salary, fill="reds", color="transparent") ``` The default for formatting both axis labels is to round numeric values of thousands, such as 100000 to 100K. With parameter `axis_fmt`, this default of to `{"K"}` can be changed. Also can specify `{","}` to insert commas in large numbers with a decimal point or `{"."}` to insert periods, or `{""}` to turn off formatting. The value of `{"K"}` can also be combined with `{","}` or `{"."}` by forming a vector of values, such as `c("K", ",")`. Axis labels can also formatted by adding a prefix to a numeric value with the parameter `axis_pre`, such as `$` or `€`. The value of `axis_pre` can be multiple characters, such as for the Brazilian currency, `R$`. ```{r axis, fig.width=4, fig.height=3.5, fig.align='center', fig.cap="Formatted axis values."} X(Salary, axis_fmt=",", axis_x_pre="$") ``` ## Density Plot The histogram portrays a continuous distribution with discrete bins, with more modern visualizations available that directly display the estimated underlying smooth curve. >_Density plot_: A smooth curve that estimates the underlying continuous distribution. To create a density plot, specify the value of `type` as `"density"` parameter. The result is the filled density curve superimposed on the histogram. ```{r density, fig.width=4, fig.height=3.5, fig.align='center', fig.cap="Histogram with density plot."} X(Salary, type="density") ``` The `kind` parameter indicates the type of density curve. The default is `"general"`. Options are `"normal"` for a normal density curve and `"both"` for both. ## VBS Plot A more modern version of the density plot combines the violin plot, box plot, and scatter plot into a single visualization, here called the VBS plot. ```{r VBS, fig.width=4, fig.height=3.5, fig.align='center', fig.cap="VBS plot."} X(Salary, type="vbs") ``` ## Interactive Input to Create a Histogram An interactive visualization lets the user in real time change parameter values to change characteristics of the visualization. To create an interactive histogram of the variable _Salary_ that displays the corresponding parameters, run the function `interact()` with `"Histogram"` specified. ``` interact("Histogram") ``` The `interact()` function is not run here because interactivity requires to run directly from the R console. ## Full Manual Use the base R `help()` function to view the full manual for `X()`. Simply enter a question mark followed by the name of the function. ``` ?Histogram ``` ## More More on Histograms and other visualizations from __lessR__ and other packages such as __ggplot2__ at: Gerbing, D., _R Visualizations: Derive Meaning from Data_, CRC Press, May, 2020, ISBN 978-1138599635.