--- title: "Diagnostic Functions Guide" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Diagnostic Functions Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, message=FALSE} library(tidyaudit) library(dplyr) ``` Before you build a pipeline, you need to understand your data. Before you trust a join, you need to verify the keys. Before you filter, you want to know what you'll lose. tidyaudit's diagnostic functions help with the detective work that comes before, during, and after every analysis. These functions are designed for **interactive use** — the questions you ask in the console while exploring a dataset or debugging a transformation. They complement the [audit trail system](tidyaudit.html), which instruments production pipelines. ## Before you join You're about to left-join orders to customers. Will every order match? Will the join create duplicates? How many rows will end up with NAs? `validate_join()` analyzes a potential join **without performing it**, reporting match rates, relationship type, duplicate keys, and unmatched rows: ```{r validate-join} orders <- data.frame( id = c(1L, 2L, 3L, 3L, 4L, 5L), amount = c(100, 200, 150, 175, 300, 50) ) customers <- data.frame( id = c(2L, 3L, 6L), name = c("Alice", "Bob", "Carol") ) validate_join(orders, customers, by = "id") ``` ### Different key names When the key columns have different names, use a named vector: ```{r validate-join-named} products <- data.frame(prod_id = 1:3, price = c(10, 20, 30)) sales <- data.frame(item_id = c(1L, 1L, 2L), qty = c(5, 3, 7)) validate_join(products, sales, by = c("prod_id" = "item_id")) ``` ### Tracking the impact on a metric Want to know the revenue at risk from unmatched keys? Use `stat` (same column name in both tables) or `stat_x`/`stat_y` (different column names): ```{r validate-join-stat} x <- data.frame(id = 1:4, revenue = c(100, 200, 300, 400)) y <- data.frame(id = c(2L, 3L, 5L), cost = c(10, 20, 30)) validate_join(x, y, by = "id", stat_x = "revenue", stat_y = "cost") ``` ## Are your keys unique? You assume `id` uniquely identifies every row. Does it? `validate_primary_keys()` tests whether a set of columns forms a valid primary key: ```{r validate-pk} df <- data.frame( id = c(1L, 2L, 3L, 3L, 4L), group = c("A", "A", "B", "C", "A"), value = c(10, 20, 30, 40, 50) ) # Single column -- not unique validate_primary_keys(df, "id") # Composite key -- unique validate_primary_keys(df, c("id", "group")) ``` ### Beyond uniqueness: what's the relationship? `validate_var_relationship()` determines the functional relationship between two columns — one-to-one, one-to-many, many-to-one, or many-to-many: ```{r validate-var-rel} df2 <- data.frame( dept = c("Sales", "Sales", "Engineering", "Engineering"), manager = c("Ann", "Ann", "Bob", "Bob") ) validate_var_relationship(df2, "dept", "manager") ``` ## What changed between versions? You have yesterday's extract and today's extract. Or the data before your transformation and after. What's different? `compare_tables()` compares two data frames by examining columns, row counts, key overlap, and numeric discrepancies: ```{r compare-tables} before <- data.frame(id = 1:5, value = c(10.0, 20.0, 30.0, 40.0, 50.0)) after <- data.frame(id = 1:5, value = c(10.0, 22.5, 30.0, 40.0, 55.0)) compare_tables(before, after) ``` ## Filter with visibility `dplyr::filter()` silently removes rows. You apply the condition, the rows vanish, and you move on. `filter_keep()` and `filter_drop()` do the same filtering but tell you exactly what was removed — and optionally warn you when too much is gone. ### filter_keep Keeps rows where the condition is `TRUE` (same as `dplyr::filter()`), with diagnostic output: ```{r filter-keep} sales <- data.frame( id = 1:10, amount = c(500, 25, 1200, 80, 3000, 15, 750, 40, 2000, 60), status = rep(c("valid", "suspect"), 5) ) result <- filter_keep(sales, amount > 100, .stat = amount) ``` ### filter_drop Drops rows where the condition is `TRUE` (the inverse): ```{r filter-drop} result2 <- filter_drop(sales, status == "suspect", .stat = amount) ``` ### Warning thresholds Set `.warn_threshold` to get a warning when too many rows are dropped — a safety net for production pipelines: ```{r filter-warn, warning=TRUE} filter_keep(sales, amount > 1000, .stat = amount, .warn_threshold = 0.5) ``` ## Diagnose data quality You just received a dataset. Where are the gaps? ### Missing values `diagnose_nas()` reports NA counts and percentages for every column: ```{r diagnose-nas} messy <- data.frame( id = 1:6, name = c("A", NA, "C", "D", NA, "F"), score = c(10, 20, NA, NA, 50, NA), grade = c("A", "B", "C", NA, "A", "B") ) diagnose_nas(messy) ``` ### Column summaries `summarize_column()` gives type-appropriate statistics for a single vector — quantiles for numeric, value counts for character, balance for logical: ```{r summarize-column} summarize_column(c(1, 2, 3, NA, 5, 10, 100)) summarize_column(c("apple", "banana", "apple", "cherry", NA)) ``` `get_summary_table()` applies this across all columns (or a selection): ```{r get-summary-table} get_summary_table(messy) ``` ## Frequency tables Base R's `table()` gives you raw counts. `tab()` gives you percentages, cumulative totals, sorting, cutoffs, weighting, and two-way crosstabulations. ### One-way tables ```{r tab-oneway} tab(mtcars, cyl) ``` ### Sorting and cutoffs ```{r tab-sort} # Sort by frequency tab(mtcars, carb, .sort = "freq_desc") # Keep only top-2 values, collapse rest into (Other) tab(mtcars, carb, .cutoff = 2) ``` ### Two-way crosstabulations ```{r tab-twoway} tab(mtcars, cyl, gear) # Show row percentages instead of counts tab(mtcars, cyl, gear, .display = "row_pct") ``` ### Weighted tabulation ```{r tab-weighted} tab(mtcars, cyl, .wt = mpg) ``` ## String quality Duplicates hiding behind case differences, leading spaces, and encoding issues are among the most common causes of failed joins and inflated group counts. `diagnose_strings()` audits a character vector for these problems: ```{r diagnose-strings} firms <- c("Apple", "APPLE", "apple", " Microsoft ", "Google", NA, "") diagnose_strings(firms) ``` ## Audit your transformations You're about to apply `trimws()` to a column, or `round()` to a price vector. How many values will actually change? What will they look like afterward? `audit_transform()` shows exactly what a transformation does. It automatically detects the vector type and computes type-appropriate diagnostics: - **Numeric**: mean/min/max shift, proportion changed - **Date/POSIXct**: range change, proportion shifted - **Factor**: level remapping (gained/lost levels) - **Logical**: TRUE/FALSE/NA balance shift - **Character**: string-level before/after comparison ### Character vectors ```{r audit-transform} audit_transform(firms, trimws) audit_transform(firms, tolower) ``` ### Numeric vectors ```{r audit-transform-numeric} prices <- c(10.456, 20.789, 30.123, NA, 50.999) audit_transform(prices, round) ``` ### Date vectors ```{r audit-transform-date} dates <- as.Date(c("2024-01-15", "2024-06-30", "2024-12-01", NA)) audit_transform(dates, function(d) d + 30) ``` ### Factor vectors ```{r audit-transform-factor} sizes <- factor(c("S", "M", "L", "XL", "XXL", "S", "M")) audit_transform(sizes, function(f) { levels(f)[levels(f) %in% c("XL", "XXL")] <- "XL+" f }) ``` ### Logical vectors ```{r audit-transform-logical} flags <- c(TRUE, FALSE, TRUE, NA, FALSE) audit_transform(flags, function(x) !x) ```