--- title: "3. Models: Regression, Least-Squares and Logistic" author: "David Gerbing" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{3. Models: Regression, Least-Squares and Logistic} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r} library("lessR") ``` ```{r, include=FALSE} knitr::opts_chunk$set(fig.width=3.5, fig.height=3) ``` The `Regression()` function performs multiple aspects of a complete regression analysis. Abbreviate with `reg()`. To illustrate, first read the Employee data included as part of __lessR__. Read into the default __lessR__ data frame _d_. ```{r read} d <- Read("Employee") ``` As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be a `csv` file or an Excel file. Read the label file into the _l_ data frame, currently the only permitted name. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label, as shown in the display of the label file. ```{r labels} l <- rd("Employee_lbl") l ``` ## Default Analysis ### Brief Output The brief version provides just the basic analysis, what Excel provides, plus a scatterplot with the regression line, which becomes a scatterplot matrix with multiple regression. Because _d_ is the default name of the data frame that contains the variables for analysis, the `data` parameter that names the input data frame need not be specified. Here, specify _Salary_ as the target or response variable with features, or predictor variables, _Years_ and _Pre_. ```{r fig.width=4.5, fig.height=4.5} reg_brief(Salary ~ Years + Pre) ``` ### Full Output The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals. The motivation is to provide virtually all of the information needed for a proper regression analysis. ```{r} reg(Salary ~ Years + Pre) ``` ### Standardize the Variables Request a briefer output with the `reg_brief()` version of the function. Standardize the predictor variables in the model by setting the `new_scale` parameter to `"z"`. Plot the residuals as a line connecting each data point to the corresponding point on the regression line as specified with the `plot_errors` parameter. To also standardize the response variable, set parameter `scale_response` to `TRUE`. ```{r, fig.width=4.5, fig.height=4} reg_brief(Salary ~ Years, new_scale="z", plot_errors=TRUE) ``` ### k-fold Cross-validation Specify a cross-validation with the `kfold` parameter. Here, specify three folds. The function automatically creates the training and testing data sets. ```{r} reg(Salary ~ Years, kfold=3) ``` The standard output also includes $R^2_{press}$, the value of $R^2$ when applied to new, previously unseen data, a value comparable to the average $R^2$ on test data. ## Output as a Stored Object The output of `Regression()` can be stored into an R object, here named _r_. The output object consists of various components that together define the output of a comprehensive regression analysis. R refers to the resulting output structure as a _list_ object. ```{r} r <- reg(Salary ~ Years + Pre) ``` Entering the name of the object displays the full output, the default output when the output is directed to the R console instead of saving into an R object. ```{r} r ``` Or, work with the components individually. Use the base R `names()` function to identify all of the output components. Component names that begin with `out_` are part of the standard output. Other components include just data and statistics designed to be input in additional procedures, including R markdown documents. ```{r} names(r) ``` Here, only display the estimates and their inferential analysis as part of the standard text output. ```{r} r$out_estimates ``` Here, display the numeric values of the coefficients. ```{r} r$coefficients ``` An analysis of hundreds or thousands of rows of data can make it difficult to locate a specific prediction interval of interest. To initiate a search for a specific row, first do the regression and request all prediction intervals with parameter `pred_rows`. Then convert that output to a data frame named _dp_ with base R `read.table()`. As a data frame, do a standard search for an individual row for a specific prediction interval (see the Subset a Data Frame vignette for directions to subset). This particular conversion to a data frame requires one more step. One or more spaces in the `out_predict` output delimit adjacent columns, but the names in this data set are formatted with a comma followed by a space. Use base R `sub()` to remove the space after the comma before converting to a data frame. ```{r} r <- reg(Salary ~ Years, pred_rows="all", graphics=FALSE) r$out_predict = sub(", ", ",", r$out_predict, fixed=TRUE) dp <- read.table(text=r$out_predict) dp[.(row.names(dp) == "Pham,Scott"),] ``` ## Contrasts Because `reg()` accomplishes its computations with base R function `lm()`, `lm()` parameters can be passed to `reg()`, which then passes the values to `lm()` to define the corresponding indicator variables. Here, first use base R function `contr.sum()` to calculate an effect coding contrast matrix for a categorical variable with three levels, such as the variable _Plan_ in the _Employee_ data set. ```{r} cnt <- contr.sum(n=3) cnt ``` Now use the `lm()` parameter `contrasts` to define the effect coding for _JobSat_, passed to `reg_brief()`. Contrasts only apply to factors, so convert _JobSat_ to an R factor before the regression analysis, a task that should generally be done for all categorical variables in an analysis. Here, designate the order of the levels on output displays such as a bar graph. ```{r fig.width=4.5, fig.height=4} d$JobSat <- factor(d$JobSat, levels=c("low", "med", "high")) reg_brief(Salary ~ JobSat, contrasts=list(JobSat=cnt)) ``` ## Null Model The $R^2$ fit statistic compares the sum of the squared errors of the model with the X predictor variables to the sum of squared errors of the null model. The baseline of comparison, the null model, is a model with no X variables such that the fitted value for each set of X values is the mean of response variable $y$. The corresponding slope intercept is the mean of $y$, and the standard deviation of the residuals is the standard deviation of $y$. The following submits the null model for _Salary_, and plots the errors. Compare the variability of the residuals to a regression model of _Salary_ with one or more predictor variables. To the extent that the inclusion of one or more predictor variables in the model reduces the variability of the data about the regression line compared to the null model, the model fits the data. ```{r fig.width=5} reg_brief(Salary ~ 1, plot_errors=TRUE) ``` Can also get the null model plot from the __lessR__ function `Plot()` with the `fit` parameter set to `"null"`. ## Likert Type Data The scatterplot is displayed as a bubble plot when both variables consist of less than 10 unique integer values. With the bubble plot, there is no overprinting of the same point so that the number of values that represent a point is displayed. ```{r likert, fig.width=4.5, fig.height=4.5} dd <- Read("Mach4") reg_brief(m10 ~ m02, data=dd) ``` ## Analysis of Covariance Obtain an ANCOVA by entering categorical and continuous variables as predictor variables. For a single categorical variable and a single continuous variable, `Regression()` displays the regression line for each level of the categorical variable. The ANCOVA assumes that the slopes for the different levels of the categorical variable are the same for the pairing of the continuous predictor variable and continuous response variable. Visually evaluate this assumption by plotting each separate slope and scatterplot. ```{r, fig.width=5, fig.height=4} Plot(Salary, Years, by=Dept, fit="lm") ``` Then, if the slopes are not too dissimilar, run the ANCOVA. The categorical variable must be interpretable as a categorical variable, either as an R variable type `factor` or as a non-numerical type `character` string. If the categorical variable is coded numerically, convert to a `factor`, such as `d$CatVar <- factor(d$CatVar)` which retains the original numerical values as the value labels. The ANCOVA displays the appropriate Type II Sum of Squares in its ANOVA table for properly evaluating the group effect that corresponds to the entered categorical variable. Note that this SS is only displayed for an ANOVA with a single categorical variable and a single covariate ```{r, fig.width=5, fig.height=4} reg_brief(Salary ~ Dept + Years) ``` ## Moderation Analysis To do a moderation analysis, specify one of (only) two predictor variables with the parameter `mod`. ```{r mod, fig.width=4.5, fig.height=4.5} reg_brief(Salary ~ Years + Pre, mod=Pre) ``` In this analysis, Pre is _not_ a moderator of the impact of Years on Salary. There is a tendency expressed by the non-parallel lines in the visualization, and an almost significant interaction, but the interaction was not detected at the $\alpha=0.5$ level. ## Logistic Regression For a model with a binary response variable, $y$, specify multiple logistic regression with the usual R formula syntax applied to the __lessR__ function `Logit()`. The output includes the confusion matrix and various classification fit indices. ### Default Analysis ```{r, fig.width=5, fig.height=4} d <- Read("BodyMeas") Logit(Gender ~ Hand) ``` ### Change Classification Threshold Specify additional probability thresholds for classification beyond just the default 0.5 with the `prob_cut` parameter. ```{r, fig.width=5, fig.height=4} Logit(Gender ~ Hand, prob_cut=c(.3, .5, .7)) ``` ### Plot Conditional Means across Bins Categorize Hand size into six bins. Compute the conditional mean of Gender, scored as 0 and 1, at each level of Hand size. Both variables must be numeric. The visualization approximates the form of the sigmoid function from logistic regression. The point (bubble) size depends on the sample size for the corresponding bin. ```{r, fig.width=5, fig.height=4} d$Gender <- ifelse (d$Gender == "M", 1, 0) Plot(Hand, Gender, n_bins=6) ``` ## Interpreted Output The parameter `Rmd` creates an R markdown file that is automatically generated and then the corresponding html document from knitting the various output components together with full interpretation. A new, much more complete form of computer output. Not run here. ``` reg(Salary ~ Years + Pre, Rmd="eg") ``` ## Full Manual Use the base R `help()` function to view the full manual for `Regression()`. Simply enter a question mark followed by the name of the function, or its abbreviation. ``` ?reg ```