--- title: "Fast topic modeling with real books" author: | Dan Hicks output: html_document: toc: true toc_float: true vignette: > %\VignetteIndexEntry{Fast topic modeling with real books} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- In this vignette, we analyze a corpus of works from the long nineteenth century, attempting to recover the author of each one. The corpus is provided in the [`tmfast.realbooks` data package](https://github.com/dhicks/tmfast.realbooks). To install this data package use `remotes`: ```{r} #| eval: false remotes::install_github('dhicks/tmfast.realbooks') ``` or specify the drat repository: ```{r} #| eval: false install.packages('tmfast.realbooks', repos = 'https://dhicks.github.io/drat/') ``` ## Setup ```{r} #| echo: false knitr::opts_chunk$set( eval = requireNamespace('tmfast.realbooks', quietly = TRUE) ) if (!requireNamespace('tmfast.realbooks', quietly = TRUE)) { warning('Data package not available; skipping execution') } ``` ```{r} library(dplyr) library(tidyr) library(tibble) library(ggplot2) theme_set(theme_minimal()) library(ggbeeswarm) library(tictoc) library(tidytext) library(tmfast) library(tmfast.realbooks) ``` ## Corpus assembly We analyze works by 10 authors of the long nineteenth century: Jane Austen, Anne, Charlotte, and Emily Brontë, Louisa May Alcott, George Eliot, Mary Shelley, Charles Dickens, HG Wells, and HP Lovecraft. We load the corpus and use `tidytext::unnest_tokens()` to convert it into a long-format document-term matrix. ```{r} data(corpus_raw) ## ~17 sec tic() dataf = corpus_raw |> unnest_tokens(term, text, token = 'words') |> count(gutenberg_id, author, title, term) toc() meta_df = distinct(dataf, author, title) dataf ``` To reproduce the corpus download from scratch (requires network access and several minutes), see `data-raw/download.R` in the [`tmfast.realbooks` package](https://github.com/dhicks/tmfast.realbooks). The number of works by each author varies widely, as does the total token count. ```{r} distinct(dataf, author, title) |> count(author) with(dataf, n_distinct(author, title)) ``` ```{r} dataf |> group_by(author, title) |> summarize(n = sum(n)) |> summarize( min = min(n), median = median(n), max = max(n), total = sum(n) ) |> arrange(desc(total)) dataf |> group_by(author, title) |> summarize(n = sum(n)) |> ggplot(aes(author, n, color = author)) + geom_boxplot() + geom_beeswarm() + scale_color_discrete(guide = 'none') + coord_flip() ``` ## Vocabulary selection In line with a common rule of thumb in topic modeling, we aim for a vocabulary of about 10 times as many terms as documents in the corpus. ```{r} vocab_size = n_distinct(dataf$author, dataf$title) * 10 vocab_size ``` `tmfast` provides two information-theoretic methods for vocabulary selection. Both are based on the idea of a two-player guessing game. I pick one of the documents from the corpus, then one of the terms from the document. I tell you the term, and you have to guess which document I picked. More informative terms have greater information gain (calculated as the Kullback-Leibler divergence) relative to a "baseline" distribution based purely on the process used to pick the document. The difference between the two methods is in the document-picking process. The `ndH` method assumes the document was picked uniformly at random from the corpus, so that no document is more likely to be picked than any other. The `ndR` method assumes document probability is proportional to the document length, so that shorter documents are less likely to be picked. This method implies that terms that are distinctive of shorter documents have high information gain, since they indicate "surprising" short documents. On either method, the most informative terms are often typographical or OCR errors, since these only occur in a single document. To balance this, we multiply the information gain ($\Delta H$ for the uniform process, $\Delta R$ for the length-weighted process) by the log frequency of the term across the entire corpus ($\log n$). So `ndH` is shorthand for $\log(n) \Delta H$ while `ndR` is shorthand for $\log(n) \Delta R$. ```{r} tic() H_df = ndH(dataf, title, term, n) R_df = ndR(dataf, title, term, n) |> mutate(in_vocab = rank(desc(ndR)) <= vocab_size) toc() H_df R_df ``` The resulting term ranking of the two methods tend to be similar, but `ndR` is preferable in the current case because of the additional weight it gives to distinctive terms from shorter documents. ```{r} inner_join(H_df, R_df, by = 'term') |> ggplot(aes(ndH, ndR, color = in_vocab)) + geom_point(aes(alpha = rank(desc(ndH)) <= vocab_size)) inner_join(H_df, R_df, by = 'term') |> mutate(ndH_rank = rank(desc(ndH)), ndR_rank = rank(desc(ndR))) |> ggplot(aes(ndH_rank, ndR_rank, color = in_vocab)) + geom_point(aes(alpha = ndH_rank <= vocab_size)) + scale_x_log10() + scale_y_log10() ``` ```{r} vocab = R_df |> filter(in_vocab) |> pull(term) head(vocab, 50) ``` ```{r} dataf |> filter(term %in% vocab) |> group_by(author, title) |> summarize(n = sum(n)) |> ggplot(aes(author, n, color = author)) + geom_boxplot() + geom_beeswarm() + scale_color_discrete(guide = 'none') + coord_flip() ``` ## Fit topic models ```{r} dtm = dataf |> filter(term %in% vocab) |> mutate(n = log1p(n)) n_authors = n_distinct(dataf$author) tic() fitted_tmf = tmfast( dtm, n = c(5, n_authors, n_authors + 5), row = title, column = term, value = n ) toc() screeplot(fitted_tmf, npcs = n_authors + 5) ``` ## Topic exploration Without renormalization, most of the works are spread across a few topics, and the topics don't clearly correspond to authors. ```{r} tidy(fitted_tmf, n_authors, 'gamma') |> left_join(meta_df, by = c('document' = 'title')) |> ggplot(aes(document, gamma, fill = topic)) + geom_col() + facet_wrap(vars(author), scales = 'free_x') + scale_x_discrete(guide = 'none') + scale_fill_viridis_d() ``` To renormalize, we need to choose a theoretical Dirichlet distribution. ```{r} alpha = peak_alpha(n_authors, 1, peak = .8, scale = 10) target_entropy = expected_entropy(alpha) target_entropy exponent = tidy(fitted_tmf, n_authors, 'gamma') |> target_power(document, gamma, target_entropy) exponent tidy(fitted_tmf, n_authors, 'gamma', exponent = exponent) |> left_join(meta_df, by = c('document' = 'title')) |> ggplot(aes(document, gamma, fill = topic)) + geom_col() + facet_wrap(vars(author), scales = 'free_x') + scale_x_discrete(guide = 'none') + scale_fill_viridis_d() tidy(fitted_tmf, n_authors, 'gamma', exponent = exponent) |> left_join(meta_df, by = c('document' = 'title')) |> ggplot(aes(topic, document, fill = gamma)) + geom_raster() + facet_grid(rows = vars(author), scales = 'free_y', switch = 'y') + scale_y_discrete(guide = 'none') + theme(strip.text.y.left = element_text(angle = 0)) ``` After renormalization, there are distinctive topics for Alcott (5), Austen (4), and Wells (6 and 10). The Brontë sisters appear in topic 9, along with Eliot and Shelley; Charlotte and Shelley share topic 1. Dickens, Eliot, and Lovecraft share topic 7. Dickens, Lovecraft, and Wells are all spread across multiple topics. To aid interpretation, we create a crosswalk dataframe connecting topics to authors. ```{r} topic_author = tribble( ~topic , ~authors , 'V01' , 'C. Brontë, Shelley' , 'V02' , 'Dickens and Lovecraft' , 'V03' , 'Dickens' , 'V04' , 'Austen and Shelley' , 'V05' , 'Alcott' , 'V06' , 'Wells' , 'V07' , 'Dickens, Eliot, Lovecraft' , 'V08' , 'Dickens' , 'V09' , 'Brontës, Eliot, Shelley' , 'V10' , 'Dickens, Lovecraft, Wells' ) ``` To explore these topics further, we turn to the word-topic distribution. We renormalize these distributions, as with the topic-doc distributions. ```{r} target_entropy_term = expected_entropy(5, k = vocab_size) target_entropy_term exponent_term = tidy(fitted_tmf, n_authors, 'beta') |> target_power(topic, beta, target_entropy_term) exponent_term beta_df = tidy(fitted_tmf, n_authors, 'beta', exponent = exponent_term) ``` After renormalization we construct a Silge plot, showing the top 10 terms for each topic. `tidytext::reorder_within()` and `tidytext::scale_x_reordered()` are useful for constructing this plot. ```{r} top_terms = beta_df |> group_by(topic) |> arrange(topic, desc(beta)) |> top_n(15, beta) |> left_join(topic_author, by = 'topic') top_terms top_terms |> mutate(token = reorder_within(token, by = beta, within = topic)) |> ggplot(aes(token, beta)) + geom_point() + geom_segment(aes(xend = token), yend = 0) + facet_wrap(vars(topic, authors), scales = 'free_y') + coord_flip() + scale_x_reordered() ``` Most topics focus on character names, with three of the four Dickens topics corresponding to *The Pickwick Papers*, *Oliver Twist*, and *David Copperfield*. Wells' topics appear to distinguish non-fiction essays (topic 6) from fiction (10). Topic 1 groups together Charlotte Brontë and Shelley based on the use of French.