---
title: "0. Data: Read, Write"
author: "David Gerbing"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{0. Data: Read, Write}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---


```{r include=FALSE}
suppressPackageStartupMessages(library("lessR"))
suppressPackageStartupMessages(library("kableExtra"))
```


```{r, include=FALSE}
knitr::opts_chunk$set(fig.width=3.5, fig.height=3)
```


## Data Table

Analyze the data values for at least one variable, such as the annual salaries of employees at a company. Typically have available the data for multiple variables. Organize the data values into a specific kind of structure from which analysis proceeds.

>_Data Table_: Organize data values into a rectangular data table with the data values for each variable in a column, and the name of the variable at the top of the column.
  
Store the structured data values within a computer file, such as an Excel or OpenDocument Spreadsheet (ODS) formatted file or text file. This file can be stored on your computer, an accessible local network, or the world wide web.
The data table in the following figure, formatted as an Excel file, contains four variables: _Years_, _Gender_, _Dept_, and _Salary_ plus an ID field called _Name_ for a total of five columns.


```{r dataTable, echo=FALSE, out.width='65%', fig.asp=.7, fig.align='center', fig.cap="Structure of a data table."}
knitr::include_graphics(system.file("img", "DataTable.png", package="lessR"))
```

Describe the data table by its columns, rows, and cell entries.

>_Data value_: The contents of a single cell of a data table, a specific measurement.

For example, according to the data values for employee Darnell Ritchie, he has worked at the company for 7 years, identifies as a male, and works in administration with an annual salary of \$43,788.26. 

>_Missing data_: A cell for which there is no recorded data value.

Two data values in this section of the data table are missing. The number of years James Wu has worked at the company is not recorded, nor is the department in which Alissa Jones works.

>_Variable name_: A short, concise word or abbreviation that identifies a column of data values in a data table. 

>_Instance_ or Example or Sample or Case: Each row of the data table, the data for a specific instance of a single person, organization, place, event, or whatever is being studied.

Encode the data table in one of a variety of computer file formats. Common formats include Excel files (`.xlsx`), OpenDocument Spreadsheet files (`.ods`), and text files the form of comma-separated value files (`.csv`), tab-delimited text files (`.txt`), or space-delimited text files (`.prn`). 

Analysis of data can only proceed with the data table identified and the relevant variables identified by their name.

>R data analysis functions analyze the data values for one or more specified variables, identified by their names, such as _Salary_.

Analysis requires the correct spelling of each variable name, including the same pattern of capitalization.


## Read the Data into R

Your data organized as a data table exists somewhere as a data file stored on a computer system. 
To analyze data in a data table stored in a computer file, first read the data table from the computer file into a corresponding data table within a running R session.

>_Data frame_: A data table stored _within_ an R session, referenced by its name.

Each variable in a data table has a name, and so does the data table itself. Initially reference the data table stored on your computer system by its file name and location. When read into R, name the data table, the R data frame, with a name of your choice. Regardless of the file name on your computer system, typically name the data table _within_ the active R session, the data frame, as _d_ for data. Not only is _d_ easy to type, but it is also the `lessR` default data frame name for the data processed by its various analysis functions. 

When analyzing data read into R, the same data exists in two different locations: a computer file on your computer system, and an R data frame within a running R app. Different locations, different names: _same_ data. On your computer system, identify the data table by its file name and location. Within a running R app, identify the same data by the name of the data frame, such as _d_, within which the data from the computer file was read.

To read the data from a file into a data frame of a running R application, as with every other task in R, accomplish the task with a function. The R ecosystem, base R and its many packages, presents many such functions. We use the `lessR` function `Read()` for its simplicity and for its useful output that helps understanding the data that was read.

<hr>

By integrating many different read functions from R and contributed R packages, `lessR` functions `Read()` and `Write()` can seamlessly read and write data files in any one of the many file formats shown in the following table. The same `Read()` function syntax or `Write()` function syntax is identical regardless of file format, though some formats may present unique advanced options such as the `Write()` parameter `ExcelTable` for writing to an Excel table Instead of an ordinary Excel worksheet.


```{css, echo=FALSE}
.center-table table {
  width: 68%;
  margin-left: auto;
  margin-right: auto;
}
.center-table table tr:nth-child(even) {
    background-color: #f2f2f2 !important;
}
.center-table table td, 
.center-table table th {
  padding: 0.5px 4px;
  font-size: 0.98em;
}
```


::: {.center-table}
```{r}
#| label: wfmt
#| echo: false
library(knitr)
library(kableExtra)

# Create the data frame with multiple extensions in the same row
data <- data.frame(
  Extension = c(".csv<br>.tsv<br>.prn<br>.txt", 
                ".xls<br>.xlsx", 
                ".ods", 
                ".feather<br>.parquet", 
                ".rda", 
                ".sav, .zsav<br>.dta<br>.sas7bdat"
               ),
  Format = c("text, custom separator and decimal<br>
             text, tab separated values<br> 
             text, space separated values<br> 
             text, comma or tab separated",
             "Excel", 
             "ODS", 
             "feather<br>parquet", 
             "R", 
             "SPSS<br>Stata<br>SAS"
            ),
  Package = c("R utils", "openxlsx", "readODS", "arrow", "R base", "haven"
             ),
  Source = c("read.csv( )<br>read.delim( )<br>read.table( )<br>read.table( )<br>",
             "read.xlsx( )<br>", "read_ODS( )",
             "read_feather( )<br>read_parquet( )",
             "load( )",
             "read_spss( )<br>read_sas( )<br>read_stata( )"
            )
)

knitr::kable(data, format="html", escape=FALSE,
             caption="`lessR` `Read()` and `Write()` available data file formats.") |>
  kableExtra::kable_styling("striped") |>
  column_spec(1, background = rgb(236,243,249, maxColorValue=255)) # Col 1
```
:::

Notes.

- Data stored in all of these data file formats can be read and written by both R and Python functions, so the resulting data files are available to both systems.

- If the data are in the form of a text file that follows the European and South American convention of using a comma `,` for the decimal point, and if adjacent values are delimited with a semicolon, `;`, then either use `Read2()`, or specify a `format` of `txt` and then  manually add the parameters `dec=","` for the decimal indicator and `sep=";"` to `Read()` for the data value separator.

- Data file formats `.feather` and `.parque` are modern, recently developed cross-application formats, particularly applicable for data reading speed and large file size, respectively. The corresponding write functions from the `arrow` package strip away any row names. `Write()` does extra processing to save the row names as an additional variable, which is then recognized by `Read()` and converted back to row names.

- Packages `R utils` and `R base` are two of the six core R packages downloaded with R.

- `text` files consist only of text and a few control characters such as for a new line. They can be created with any text editor or word processor such as  MS Word or spreadsheet such as MS Excel. These generally recognizable listed four file types are also provided from MS Excel `File -> Save As...` .

- `.prn` is a file extension for "print" file, with columns aligned separated by blank spaces, such as saved by MS Excel. Of course, in this format missing data must be designated with codes instead of empty cells, such as part of a function call to `Read()` with parameter and values `missing=c(-99, "xxxx")` to define numeric and character string missing data codes if these are present in the data.

- SPSS files read by `Read()` undergo additional processing. Each (usually) integer variable with value labels is converted into two R variables: the original numeric code with the original variable name, and also the corresponding factor with the variable labels named with the original name plus the suffix _f.

- The `Read()` defaults are generally applicable but more options are usually avaialble for each core read function. To learn more, download the respective package and access the corresponding function help manual with `?function_name`, such as `?read_feather`.

<hr>


### Browse for the data file

To read the data, direct R to the location of the data file. R cannot read the data file until it knows where the data is stored. 
One option has you locate the data file on your computer system by browsing for it, navigating your file system until you locate the file. 

> To locate your data file by browsing through your file system, call the `Read()` function with an empty file reference `("")`, literally nothing between the quotes.

The following `Read()` statement reads the data stored as a rectangular data table from an external file stored on your computer system into an R data frame called _d_. The empty quotes indicate to R to open your file browser for you to locate the data file that already exists somewhere on your computer system).

```
d <- Read("")
```

As with all R (and Excel) functions, the call to invoke the function includes a matching set of parentheses. Any information within the parentheses specifies the information provided to the function for analysis.

The `<-` indicates to assign what is on the right of the expression, here the data read from an external file, to the object on the left, here the R data frame stored within R, named _d_ in this example. You can also use an ordinary equals sign, `=`, to indicate the assignment, but the `<-` is more descriptive, and more widely used by R practitioners.


### Specify location of the data file

Specify either the full path name of a file on your computer system, or specify a web address that locates the data table on the web. Also can explicitly select the location of the data file to be read within the quotes and parentheses. Specify either the full path name of a file on your computer system, or specify a web address that locates the data table on the web. Again, read the data into the _d_ data frame.\

```
d <- Read("path name" or "web address")
```

With Excel, R, or any other computer apps that processes data, enclose values that are character strings, such as a file name, in quotes. For example, to read the data stored on the web in the data file called _employee.xlsx_ into the data frame _d_, invoke the following `Read()` function call.


```
d <- Read("http://web.pdx.edu/~gerbing/data/employee.xlsx")
```
This example reads a data file on the web. To specify a location on your computer, provide the full path name of your data file, its name and location. To obtain this path name, first browse for the file with `Read("")`. The resulting output displays the path name of the identified file. Copy this path name and insert between the quotes of `Read("")`, save this and other R function calls in a text file, and then run the code in the future to directly read the data file for future analyses without needing to browse for its location.

With the `Read()` function, put nothing between the quotes to browse for a data file, or specify the location of the data file on your computer system or the web. Direct the data read from a file into an R data frame, usually named _d_, but can choose any valid name.


### Output of `Read()`

The `Read()` function displays useful output. Because R organizes analyses by variable name, it is crucial to know the exact variable names, including the pattern of capitalization.

In addition to the variable names, `Read()` displays each variable's type as stored in the computer, as numbers with or without decimal digits, or as character strings.  Also listed are the number of complete and missing values for each variable, the number of unique values for each variable, and sample data values.  

The following lists the output from reading from a data file downloaded with `lessR`. All that is needed to read these data files is the name. A file on the web cannot be specified here because you may not have web access when this file is generated.


```{r rline, fig.align='center', fig.cap="Output of Read()."}
d <- Read("Employee")
```

To allow for many variables, `Read()` lists the information for each variable in a row. Note that the data file organizes the variables by column.

Get more information from `Read()`, especially regarding missing data, by setting `brief` to `FALSE`.

```{r brief, fig.align='center', fig.cap="Output of Read()."}
d <- Read("Employee", brief=FALSE)
```

Obtain this same information at any time during an R analysis by calling the function `details()`. `Read()` calls this same function after reading the data, with `brief` set to `TRUE` by default.


### Two types of variables

Always distinguish _continuous variables_ from _categorical variables_. This distinction between these two types of variables is fundamental in data analysis. 

>_Continuous (quantitative) variable_: A numerical variable with many possible values.

>_Categorical (qualitative) variable_: A variable with relatively few unique labels as data values.

Examples of continuous variables are _Salary_ or _Time_. Examples of categorical variables are _Gender_ or _State_ of Residence, each with just a relatively few number of possible values compared to numerical values. This distinction of continuous and categorical variables is common to all data analytics.

Sometimes that distinction gets a little confusing because variables with integer values, which are numeric, could be either quantitative or qualitative. For example, sometimes Male, Female, and Other are encoded as 0, 1, and 2, respectively, for three levels of the categorical variable _Gender_. However, these integer values are just labels for different non-numeric categories. Best to avoid this confusion. Instead, encode categorical variables with non-numeric values, such as _Gender_, for example, with M F, and O for Other.


## Read Variable Labels

A variable label is a longer description of the corresponding variable than that of the variable name. The variable label displays in conjunction with the variable name on the text and visualization output to further clarify the interpretability of the output.

The variable label file has two columns. The the first column lists the variable names and the second column the corresponding labels. The file can be of type `.csv`, or `.xlsx`, or contained within `lessR`, as with the following example.

The variable labels must be read into the data frame _l_. Specify the `var_labels` parameter as `TRUE` to instruct the `Read()` function to read variable labels instead of data.

```{r}
l <- Read("Employee_lbl", var_labels=TRUE)
```


## Write the Data Table to a File

Relying upon the __have__ package, `lessR` can write data with function `Write()` in formats: text, Excel, ODS, and native R and native SPSS formats. 


::: {.center-table}
```{r}
#| label: fmt
#| echo: false
library(knitr)
library(kableExtra)

# Create the data frame with multiple extensions in the same row
data <- data.frame(
  Extension = c(".txt<br>.csv<br>.tsv<br>.prn", 
                ".xlsx", 
                ".ods", 
                ".feather", 
                ".parquet", 
                ".rda", 
                ".sav"),
  Format = c("text, custom separator and decimal<br>
             text, comma separated values<br> 
             text, tab separated values<br> 
             text, space separated values", 
             "Excel", 
             "ODS", 
             "feather", 
             "parquet", 
             "R", 
             "SPSS"), 
  Package = c("R utils", "openxlsx", "readODS", "arrow", "arrow", "R base",
              "haven"),
  Function = c("write.table( )<br>write.csv( )<br>write.delim( )<br>&nbsp;",
               "write.xlsx( )<br>", "write_ods( )",
               "write_feather( )", "write_parquet( )",
               "load( )", "write_sav( )")
)

# Create the formatted table with a light blue first column
knitr::kable(data, format="html", escape=FALSE,
             caption="`lessR` `Read()` and `Write()` data file formats.") |>
  kableExtra::kable_styling("striped") |>
  column_spec(1, background = rgb(236,243,249, maxColorValue=255))  # Light blue background for first column
```
:::

The R format preserves a binary copy of the data frame as it is stored within R, as does the SPSS format for native SPSS files. The Excel and ODS formats yield worksheets. The csv format yields a comma separated value text file. The `feather` and `parquet` modern cross-analysis application formats, optimized respectively for speed of input and output and for file size.

One procedure begins the analysis of data in `.csv`, `.xlsx`, or `.ods` format. Then proceed with data cleaning and preparation, including needed transformations and re-coding. When the data is ready for analysis, save the cleaned, prepared data as a native R file of format `.rda`, or `feather` or `parquet`. These formats are the most efficient for size and for speed of reading back into R, with all data preparations already completed. Can also write the data in a format such as Excel so that those on the analysis team not using R (or Python) can also access.

With the `Write()` function, 

- specify the name of the file for the output, or rely upon the default of the file name as the input data frame name
- specify the file type with the `format` parameter, with the default of `.csv`, or indicate the full file name with the file type and the format will be inferred.
- if variable labels are present, defined 

The following R statements that call `Write()` are not run here as the intent is not to create additional files. 

The following just indicates the data frame name, which results in writing a `.csv` file to `d.csv`.

```
Write(d)
```

The following two calls to `Write()` are equivalent.

```
Write(d, "GoodData.xlsx")
Write(d, "GoodData", format="Excel")
```

Or, if you prefer to write the data file in Excel format, you can also choose to write the data as an Excel table.

```
Write(d, "GoodData", format="Excel", ExcelTable=TRUE)
```

You can also use the abbreviation for an Excel file, `wrt_x()`.

```
wrt_x(d, "GoodData")
```

Write the data as a R data table.

```
Write(d, "GoodData", format="R")
```

Can also use the abbreviation for an R file, `wrt_r()`.

```
wrt_r(d, "GoodData")
```

Use `format="ODS"` to specify to write to the OpenDocument Spreadsheet format with file type `.ods`.
Use `format="SPSS"` to specify to write to the SPSS format with file type `.sav`.

The output of `Write()` indicates the full path name of the written file.


## Read SPSS (and SAS and Stata) Files

The __haven__ package has an excellent function for reading SPSS files, `read_spss()`. `Read()` adds to the given functionality by preserving the SPSS variable labels and value labels. `Read()` invokes this function to read SPSS files, with the default file types of `.sav` and `zsav`. The functionality of __haven__ also allows for reading SAS and Stata data files.


### Value labels

Within SPSS, an integer scored variable can have value labels. An example is Likert scaled data with 1 representing a Strongly Disagree to 5 a Strongly Agree. The corresponding SPSS variable has integer values but displays the more informative value labels such as in bar charts. This type of categorical variable corresponds to a `factor` in the R system. 

The `read_spss()` function preserves the value labels with a special variable type called `haven_labels`. These variables can be converted to an R factor for processing in the R system with the __haven__ function `as_factor()`. `Read()` performs this conversion automatically for each relevant variable.

One problem is that the factor conversion preserves the value labels listed in the correct order, but looses the original integer scoring information. To preserve both the labels as a standard R factor, and to preserve the original scoring, `Read()` converts the original read variable with the labels (type `haven_labels`) into two variables. The first variable, an integer variable with the original integer scoring, has the name of the read variable. The corresponding factor variable has the same name as the read variable with the suffix `\_f`.

For example, the use of `read_spss()` results in the following, here showing just the first four lines of data. The variable _region_ contains both the integer scoring and the value label that are part of the SPSS data file that was read.

```
# A tibble: 4 x 4
  city                         region growth income
  <chr>                     <dbl+lbl>  <dbl>  <dbl>
1 ALBANY-SCHNTADY-TROY,N.Y.    1 [NE]    -71   3313
2 ATLANTA,GA.                  2 [SE]    264   3153
3 BALTIMORE,MD.                1 [NE]     38   3540
4 BIRMINGHAM,ALA.              2 [SE]   -178   2528
```

With `Read()`, obtain the following standard R data frame. The variable _region_ is now a standard R integer variable, and *region_f* is the corrsponding factor.

```
                       city region region_f growth income
1 ALBANY-SCHNTADY-TROY,N.Y.      1       NE    -71   3313
2               ATLANTA,GA.      2       SE    264   3153
3             BALTIMORE,MD.      1       NE     38   3540
4           BIRMINGHAM,ALA.      2       SE   -178   2528
```

With these data, the analyst may accomplish a numerical analysis with the integer variable, and for analyses such as a bar chart, instead, display the corresponding value labels.


### Variable labels

 `lessR` automatically accesses variable labels stored in a data frame named _l_, and then displays with the variable name in text and visualization output. Usually, the variable labels are stored in an Excel or `.csv` file with two columns, the variable name, and the variable label. There are no column titles, just the names and labels. Then read into the _l_ data frame with the `var_labels` parameter set to `TRUE`.

```
l <- Read(file_reference, var_labels=TRUE)
```

`Read()` also processes the variable labels of these (usually) integer-scored variables with value labels in the SPSS data file. The overseers of the R system do not permit package authors to create stored data structures from their internal R code. Only the user can create these structures. As such, `Read()` lists each variable name, a comma, and then the corresponding label. This example only has one such relevant variable, _region_ and its factor equivalent.

```
Variable and Variable Label  --> See vignette("Read"), SPSS section
---------------------------
region,  region of US 
region_f,  region of US
``` 

To access these labels, copy the names and labels, paste into a text file, and then save as a file. Then read the file of names/labels into R with the preceding `Read()` statement.


## Full Manual

Use the base R `help()` function to view the full manual for `Read()` or `Write()`. Simply enter a question mark followed by the name of the function.

```
?Read
?Write
```