--- title: "2. Visualize Data Collected over Time" author: "David Gerbing" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{2. Visualize Data Collected over Time} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, include=FALSE} knitr::opts_chunk$set(fig.width=6.5, fig.height=3) ``` ```{r include=FALSE} suppressPackageStartupMessages(library("lessR")) style(suggest=FALSE) ``` Plot ordered data values collected over time in one of two ways that correspond to how the values are labeled. - Run Chart: Label the data values from 1 to the number of values ordered by their time of collection. - Time Series Chart: Label the data values by the time and/or date each data value was collected. ## Run Chart Create a run chart from two variables, the $x$-variable as the sequence of consecutive integers from 1 to the number of data values, the Index values, and the $y$-variable that specifies the corresponding data values to be plotted. Meaningful for sequentially ordered numerical data values such as by time, plot a run chart of a single variable with the option of generating the Index values by specifying the name of the $x$ variable, the first variable typically listed, as `.Index`. The name begins with a $.$ so as to not confuse with an existing variable. Analogous to a time series visualization, the run chart plots the data values sequentially, but without dates or times. An analysis of the runs can also be obtained. ```{r} d <- Read("Employee") ``` The data values for the variable _Salary_ were not collected over time, but for illustration, here create a run chart of _Salary_ as if the data were collected over time. The indices, the sequence of integers from 1 to the last data value, are created by `XY()` by specifying the $x$-variable as `.Index`. Invoke the `run` parameter to instruct `XY()` to plot the data in sequential order as a run chart. ```{r} XY(.Index, Salary) ``` The default run chart displays the plotted points in a small size with connecting line segments. Change the size of the points with the parameter `size`, here set to zero to remove the points entirely. Fill the area under the line segments with the parameter `ts_area_fill`, here set to the default `on` but can express any color. Remove the center line with the parameter `center_line` set to `off`. Display the analysis of the runs with the parameter `show_runs` set to `TRUE`. ```{r} XY(.Index, Salary, size=0, ts_area_fill="on", center_line="off", show_runs=TRUE) ``` ## Time Series Chart ### Dates Create a time series from two variables, the $x$-variable as a date, and the $y$-variable that specifies the corresponding measured values to be plotted. Internally, the $x$-variable is stored as a variable of R type `Date`. Traditionally, the `Date` variable is created prior to calling `XY()`, such as with the R function `as.Date()`. However, `XY()` can also implicitly convert a character string numeric date value such as `"08/18/2024"` to a formal `Date` data value, as explained below. Plotting a variable of type `Date` as the $x$-variable in a scatterplot automatically creates a time series visualization with each pair of adjacent points connected by a line segment. R does not provide an automatic conversion of character string dates to a formal date variable, likely because the conversion is inherently ambiguous. There are multiple ways in which a numerical date can be specified and inferring the date format from data values is not always guaranteed but can usually work. `XY()` will attempt the conversion for you. To facilitate verification of the correct date format, `XY()` displays its inferred format. `XY()` allows an explicit date format specification with the parameter `ts_format`. View the list of all possible date formats, by entering `?strptime` to display the corresponding help file. Following are the five different possibilities of numerical data values read as character strings that `XY()` will convert to actual dates, an R variable of type `Date`. Expressing the year with all four digits is recommended though not usually necessary. The following examples use the hyphen, `-`, delimiter but the backslash, `/`, and period, `.`, can also be used. - 2024-08-18: Four digit year, one or two digit month, one or two digit day. - 08-18-2024: One or two digit month, one or two digit day, four digit year. - 08-18-24: One or two digit month, one or two digit day, two digit year. - 18-08-2024: One or two digit day, one or two digit month, four digit year. - 18-08-24: One or two digit day, one or two digit month, two digit year. #### Daily Data Enter the dates for daily data values in one of the above five numerical formats. Or, use the `ts_format` parameter to specify a format for non-numerical date values that can include the name of the corresponding month (as per `?strptime`). #### Weekly Data Enter the dates for weekly data values as with daily data values except that consecutive dates are one week apart. For example, each date represents the first day of the corresponding week, such as `"04/03/2024"` for the fourth day of March 4, 2024, which begins the first full week in March 2024, followed by `"11/03/2024"` for the 11th day of the same month. #### Monthly Data Two possibilities exist for entering monthly data. Enter the dates for monthly data values as either: - as with daily date values except that consecutive dates are one month apart. For example, each date represents the first day of the corresponding month, such as `"01/03/2024"` for the first day of March 2024, followed by `"1/04/2024"` for the first day of April, 2024. - four digit year followed by the three digit month abbreviation, all as a single data value. For example, `"2024 Jan"` followed by `"2024 Feb"`. #### Quarterly Data Two possibilities exist for entering quarterly data. Enter the dates for quarterly data values as either: - as with daily date values except that consecutive dates are one quarter or three months apart. For example, represent a quarter with the first day of the month for the corresponding quarter, such as `"01/01/2024"` for the first day of the first quarter followed by `"01/04/2024"` for the first day of the second quarter. - four digit year followed by either Q1, Q2, Q3, or Q4, all as a single data value. For example, `"2024 Q1"` followed by `"2024 Q2"`. #### Annual Data Two possibilities exist for entering annual data. Enter the dates for annual data values as either: - as with daily data values except that consecutive dates are one year apart. For example, reach date represents the first day of the corresponding year, such as `"01/01/2024"` for the first day of the year for 2024, followed by `"01/01/2025"` for the first day of the following year. - four digit year. For example, `"2024"` followed by `"2025"`. ### A Single Time Series Read time series data of stock _Price_ for three companies: Apple, IBM, and Intel. The data table is part of __lessR__, called `StockPrice`. ```{r} d <- Read("StockPrice") d[1:5,] ``` Activate a time series plot by setting the $x$-variable to a variable of R type `Date`, which is true of the variable _Month_ in this data set. Can also plot a time series by passing a time series object, created with the base R function `ts()` as the variable to plot. `XY()` will attempt to convert a four-digit integer year sequentially organized in increments of 1 year, or a date expressed as digits with `/` or `-` delimiters, such as `"08/18/2024"`, to a variable of type `Date`. However, this conversion is not without some ambiguity, so if it is incorrect, then specify the correct date format with parameter `ts_format`. Here, plot the stock price over time just for _Apple_, with the two variables _Month_ and _Price_, stock price. The parameter `filter` specifies the rows of the input data frame retained for the analysis. ```{r} XY(Month, Price, filter=(Company=="Apple")) ``` Add the default fill color by setting the `ts_area_fill` parameter to `"on"`. Can also specify a custom color. ```{r} XY(Month, Price, filter=(Company=="Apple"), ts_area_fill="on") ``` ### Multiple Time Series #### One One Panel With the `by` parameter, plot all three companies on the same panel. ```{r} XY(Month, Price, by=Company) ``` Stack the plots by setting the parameter `stack` to `TRUE`. ```{r} XY(Month, Price, by=Company, ts_stack=TRUE) ``` #### Facets With the `facet` parameter, plot all three companies on the different panels, a Trellis plot. ```{r} XY(Month, Price, facet=Company) ``` Do the Trellis plot with some color. Learn more about customizing visualizations in the vignette `utlities`. ```{r} style(sub_theme="black", window_fill="gray10") XY(Month, Price, facet=Company, n_col=1, fill="darkred", color="red", trans=.55) ``` Return to the default style and turn off text output for subsequent analyses. ```{r} style() style(quiet=TRUE) ``` Set a baseline of 25 with the `ts_area_split` parameter for a Trellis plot, with default fill color. ```{r} XY(Month, Price, facet=Company, xlab="", ts_area_fill="on", ts_area_split=25) ``` Change the aspect ratio with the `aspect` parameter defined as height divided by width. ```{r} XY(Month, Price, facet=Company, aspect=.5, ts_area_fill="slategray3") ``` Stack the three time series, fill under each curve with a version of the __lessR__ sequential range `"emeralds"`. ```{r, fig.width=6} XY(Month, Price, by=Company, trans=0.4, ts_stack=TRUE, ts_area_fill="emeralds") ``` ## Aggregation by Time This example aggregates monthly stock price data by quarter. Available time units are `"years"`, `"quarters"`, `"months"`, `"weeks"`, and "`days`". Also included is the special time unit `"days7"` explained below in the Forecasting section. Aggregate with the parameter `ts_unit` (which relies upon functions from the `xts` package). Generate and display the first several months of the monthly data. The stock price for each company is reported monthly in the data table. To aggregate to quarters, use the `ts_unit` parameter. The default aggregation is the sum over the specified time period. That value is appropriate if we are, for example, aggregating monthly sales over each quarter, but for stock _Price_ we want the mean stock price over the specified time period. Set the parameter `ts_agg` to `"mean"`. Focus just on the Apple stock price data with the `filter` parameter. ```{r} d <- Read("StockPrice", quiet=TRUE) XY(Month, Price, ts_unit="quarters", ts_agg="mean", filter=(Company=="Apple")) ``` Or, aggregate by years to smooth the curve further, with a darkred line. ```{r} XY(Month, Price, ts_unit="years", ts_agg="mean", filter=(Company=="Apple"), color="darkred") ``` In the following example, aggregate by years for each of the three companies. ```{r} XY(Month, Price, by=Company, ts_unit="years", ts_agg="mean") ``` ## Forecast `XY()` implements time series forecasting based on trend and seasonality with either exponential smoothing or regression analysis, including the accompanying visualization. Time series parameters include: - `ts_method`: Set at `"es"` for exponential smoothing, the default, or `"lm"` for linear model regression. - `ts_unit`: The time unit, either as the natural occurring interval between dates in the data, the default, or aggregated to a wider time interval. - `ts_ahead`: The number of time units to forecast into the future. - `ts_agg`: If aggregating the time unit, aggregate as the `"sum"`, the default, or as the `"mean"`. - `ts_PIlevel`: The confidence level of the prediction intervals, with 0.95 the default. - `ts_seasons`: Set to `FALSE` to turn off seasonality in the estimated model. - `ts_trend`: Set to `FALSE` to turn off trend in the estimated model. - `ts_error`: Type of error term. - `ts_format`: Provides a specific format for the date variable if not detected correctly by default. - `ts_source`: Default is time series forecasting from `"fable"` and related packages, or specify `"classic"`. To forecast Apple's stock price, focus on the last several years of the data, beginning with Row 400 through Row 473, the last row of data for Apple. In this example, forecast ahead 24 months. Here, rely upon the default exponential smoothing estimation procedure from the `fpp3` ecosystem package `fable`. ```{r} d <- d[400:473,] XY(Month, Price, ts_unit="months", ts_agg="mean", ts_ahead=24) ``` Next, implement the classic Holt-Winters exponential smoothing method from the Base~R function `Holt-Winters()`. ```{r} XY(Month, Price, ts_unit="months", ts_agg="mean", ts_ahead=24, ts_source="classic") ``` Or, do the regression with seasonality to forecast according to the parameter `ts_method`, here changed from its default value of `es` exponential smoothing to `lm` for linear model. The data are de-seasonalized, the regression analysis performed, and then the seasonality added back. ```{r} XY(Month, Price, ts_unit="months", ts_agg="mean", ts_ahead=24, ts_source="classic", ts_method="lm") ``` Here, do the linear regression forecast but without seasonality according to the parameter `ts_seasons`. ```{r} XY(Month, Price, ts_unit="months", ts_agg="mean", ts_ahead=24, ts_source="classic", ts_method="lm", ts_seasons="N") ``` Better to visually understand the characteristics of the time series before trying to forecast. As an aid to facilitate this understanding, here decompose the time series into its seasonal and trend components with the **lessR** function `STL()`, which relies upon the base R function `stl()` but provides more information and allows more flexible input. ```{r} STL(Month, Price) ``` The traditional time units, such as `"days"` or `"quarters"`, evaluate seasonality over the entire year. Quarterly and even monthly data can be usually be meaningfully assessed for seasonality over the entire year. With daily data, however, seasonality is generally more meaningfully assessed over the days of the week. For example, sales may typically be higher on Monday than they are on Sunday. Consider the following daily data for which we wish to evaluate seasonality over the days of the week. To indicate potential seasonality of daily data within a week, specify the time unit with parameter `ts_unit` set to `"days7"`. ```{r} #| echo: FALSE set.seed(123) days <- seq(as.Date("2024-01-01"), as.Date("2024-02-28"), by="day") sales <- 100 + sin(2 * pi * (1:59) / 7) * 15 + rnorm(59, sd=8) # Weekly seasonality d <- data.frame(days, sales) rm(days); rm(sales) ``` ```{r} XY(days, sales, ts_ahead=8, ts_unit="days7") ``` We now have seasonality coefficient for each day of the week, which are projected into the future for forecasting. ## Missing Data #### Entire Record Missing If the date value and the y-value are missing, then the nearest adjacent points are connected by a line segment that runs over the missing data value, effectively linearly interpolating the missing value across the two adjacent present values. For example, consider a daily time series related to the Tableau Superstore data such that "2021-01-07" and "2021-01-09" are both present with their corresponding y values, but there is no date value or y value for January 8, that is, "2021-01-08". To yield a single data value of Sales for each day, aggregate Sales by day. ```{r} d <- read.table(text=" Order.Date Sales 2021-01-05 19.536 2021-01-06 473.820 2021-01-06 5.480 2021-01-06 12.780 2021-01-06 609.980 2021-01-06 31.120 2021-01-06 6.540 2021-01-06 19.440 2021-01-07 176.728 2021-01-07 10.430 2021-01-09 9.344 2021-01-09 31.200 2021-01-10 51.940 2021-01-10 2.890", header = TRUE) ``` Two sales are recorded on January 7 and two sales are recorded on January 9 but there is no record for any sales or even a date for January 8. The entire row of data for January 8 is missing. Next, plot the aggregated Sales data by day for dates from January 3 through January 10. ```{r} XY(Order.Date, Sales, ts_unit="days") ``` The resulting visualization plots the y-value for January 7 and also for January 9, with a line segment connecting those two points. There is no corresponding label on the x-axis for the missing data value nor is there a plotted point. And, the January 9 value is appropriately placed two days after the January 7 value on the visualization. #### Only the y-value is Missing In terms of missing data, if the date value exists and the corresponding y-value is missing, with value \code{NA}, then the visualization leaves the corresponding y-value blank. Here, insert the missing row for January 8 with missing data, `NA`, for that date. ```{r} new_row <- data.frame( Order.Date = "2021-01-08", Sales = NA ) d <- rbind(d, new_row) d <- order_by(d, by=Order.Date) d[9:12,] ``` Now, plot. ```{r} XY(Order.Date, Sales, ts_unit="days") ``` There is now a blank space in visualization for January 8. If instead, better to treat the missing value as zero sales for that day, specify the value of 0 for parameter `ts_NA`. ```{r} XY(Order.Date, Sales, ts_unit="days", ts_NA=0) ``` ## Data Structures Data can be stored in in different types of structures, different forms of organization. `XY()` can plot a time series from three different data structures: - long-form - wide-form - time-series object ### Long-Format Data The previous examples of plotting time series data read data stored in long format. Long format data organizes data with each row of the data table containing only a single measurement. If the entity provides multiple data values, then the data values are stored in multiple rows. For example, if observations of Apple's stock price are taken monthly, then the data for each row of the data table contain only a single stock price. Or, from another perspective, the data values for each company are each store on a separate row. ```{r} d <- Read("StockPrice", quiet=TRUE) head(d) ``` Many data analysis and visualization functions across a variety of statistical systems require long format data. As such, this organization of data is the most common data structure but other possibilities do exist. ### R Time-Series Object Data `XY()` can also plot directly from an R time series object, created with the base R `ts()` function. We create the object from a wide for data table. In the wide form, the three companies each have their own column of data, repeated for each date. Use the __lessR__ function `reshape_wide()` to do the conversion. ```{r} dw <- reshape_wide(d, widen=Company, response=Price, ID=Month) head(dw) ``` From the wide-form data table for Apple stock price, create the time series object. ```{r} a1.ts <- ts(dw$Apple, frequency=12, start=c(1980, 12)) XY(a1.ts, data=NULL) ``` With the **lessR** `style()` function many themes can be selected, such as `"lightbronze"`, `"dodgerblue"`, `"darkred"`, and `"gray"` for gray scale. When no `theme` or any other parameter value is specified, return to the default theme, `colors`. ```{r} style() ``` ## Annotation The annotations in the following visualization consist of the text field "iPhone" with an arrowhead that points to the time that the first iPhone became available. With __lessR__, list each component of the annotation as a vector for add. Any value listed that is not a keyword such as "rect" or "arrow" is interpreted as a text field. Then, in order of their occurrence in the vector for add, list the needed coordinates for the objects. To place the text field "iPhone" requires one coordinate, ``. To place an "arrow" requires two coordinates, `` and ``. For example, the second element of the `y1` vector is the `y1` value for the "arrow". The text field does not require a second coordinate, so specify `x2` and `y2` as single elements instead of vectors. ```{r fig.width=4.5} d <- Read("StockPrice") x <- as.Date("2007-06-01") XY(Month, Price, filter=(Company == "Apple"), ts_area_fill="on", add=c("iPhone", "arrow"), x1=c(x,x), y1=c(100,90), x2=x, y2=30) ``` ## Full Manual Use the base R `help()` function to view the full manual for `XY()`. Simply enter a question mark followed by the name of the function. ``` ?Plot ``` ## More More on Scatterplots, Time Series plots, and other visualizations from __lessR__ and other packages such as __ggplot2__ at: Gerbing, D., _R Visualizations: Derive Meaning from Data_, CRC Press, May, 2020, ISBN 978-1138599635.