---
title: "Fast topic modeling with real books"
author: |
    Dan Hicks
    <hicks.daniel.j@gmail.com>
output:
    html_document:
        toc: true
        toc_float: true
vignette: >
  %\VignetteIndexEntry{Fast topic modeling with real books}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

In this vignette, we analyze a corpus of works from the long nineteenth century, attempting to recover the author of each one.  

The corpus is provided in the [`tmfast.realbooks` data package](https://github.com/dhicks/tmfast.realbooks). To install this data package use `remotes`: 
```{r}
#| eval: false
remotes::install_github('dhicks/tmfast.realbooks')
```
or specify the drat repository:
```{r}
#| eval: false
install.packages('tmfast.realbooks', repos = 'https://dhicks.github.io/drat/')
```

## Setup

```{r}
#| echo: false
knitr::opts_chunk$set(
      eval = requireNamespace('tmfast.realbooks', quietly = TRUE)
)
if (!requireNamespace('tmfast.realbooks', quietly = TRUE)) {
      warning('Data package not available; skipping execution')
}
```
```{r}
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
theme_set(theme_minimal())
library(ggbeeswarm)
library(tictoc)

library(tidytext)
library(tmfast)

library(tmfast.realbooks)
```

## Corpus assembly

We analyze works by 10 authors of the long nineteenth century: Jane Austen, Anne, Charlotte, and Emily Brontë, Louisa May Alcott, George Eliot, Mary Shelley, Charles Dickens, HG Wells, and HP Lovecraft. We load the corpus and use `tidytext::unnest_tokens()` to convert it into a long-format document-term matrix.

```{r}
data(corpus_raw)

## ~17 sec
tic()
dataf = corpus_raw |>
      unnest_tokens(term, text, token = 'words') |>
      count(gutenberg_id, author, title, term)
toc()

meta_df = distinct(dataf, author, title)
dataf
```

To reproduce the corpus download from scratch (requires network access and several minutes), see `data-raw/download.R` in the [`tmfast.realbooks` package](https://github.com/dhicks/tmfast.realbooks).

The number of works by each author varies widely, as does the total token count. 

```{r}
distinct(dataf, author, title) |>
      count(author)

with(dataf, n_distinct(author, title))
```


```{r}
dataf |>
      group_by(author, title) |>
      summarize(n = sum(n)) |>
      summarize(
            min = min(n),
            median = median(n),
            max = max(n),
            total = sum(n)
      ) |>
      arrange(desc(total))

dataf |>
      group_by(author, title) |>
      summarize(n = sum(n)) |>
      ggplot(aes(author, n, color = author)) +
      geom_boxplot() +
      geom_beeswarm() +
      scale_color_discrete(guide = 'none') +
      coord_flip()
```


## Vocabulary selection

In line with a common rule of thumb in topic modeling, we aim for a vocabulary of about 10 times as many terms as documents in the corpus.  

```{r}
vocab_size = n_distinct(dataf$author, dataf$title) * 10
vocab_size
```

`tmfast` provides two information-theoretic methods for vocabulary selection.  Both are based on the idea of a two-player guessing game.  I pick one of the documents from the corpus, then one of the terms from the document.  I tell you the term, and you have to guess which document I picked.  More informative terms have greater information gain (calculated as the Kullback-Leibler divergence) relative to a "baseline" distribution based purely on the process used to pick the document.  The difference between the two methods is in the document-picking process.  The `ndH` method assumes the document was picked uniformly at random from the corpus, so that no document is more likely to be picked than any other.  The `ndR` method assumes document probability is proportional to the document length, so that shorter documents are less likely to be picked.  This method implies that terms that are distinctive of shorter documents have high information gain, since they indicate "surprising" short documents.  

On either method, the most informative terms are often typographical or OCR errors, since these only occur in a single document.  To balance this, we multiply the information gain ($\Delta H$ for the uniform process, $\Delta R$ for the length-weighted process) by the log frequency of the term across the entire corpus ($\log n$).  So `ndH` is shorthand for $\log(n) \Delta H$ while `ndR` is shorthand for $\log(n) \Delta R$. 

```{r}
tic()
H_df = ndH(dataf, title, term, n)
R_df = ndR(dataf, title, term, n) |>
      mutate(in_vocab = rank(desc(ndR)) <= vocab_size)
toc()
H_df
R_df
``` 

The resulting term ranking of the two methods tend to be similar, but `ndR` is preferable in the current case because of the additional weight it gives to distinctive terms from shorter documents.  

```{r}
inner_join(H_df, R_df, by = 'term') |>
      ggplot(aes(ndH, ndR, color = in_vocab)) +
      geom_point(aes(alpha = rank(desc(ndH)) <= vocab_size))

inner_join(H_df, R_df, by = 'term') |>
      mutate(ndH_rank = rank(desc(ndH)), ndR_rank = rank(desc(ndR))) |>
      ggplot(aes(ndH_rank, ndR_rank, color = in_vocab)) +
      geom_point(aes(alpha = ndH_rank <= vocab_size)) +
      scale_x_log10() +
      scale_y_log10()
```

```{r}
vocab = R_df |>
      filter(in_vocab) |>
      pull(term)
head(vocab, 50)
```

```{r}
dataf |>
      filter(term %in% vocab) |>
      group_by(author, title) |>
      summarize(n = sum(n)) |>
      ggplot(aes(author, n, color = author)) +
      geom_boxplot() +
      geom_beeswarm() +
      scale_color_discrete(guide = 'none') +
      coord_flip()
```


## Fit topic models

```{r}
dtm = dataf |>
      filter(term %in% vocab) |>
      mutate(n = log1p(n))

n_authors = n_distinct(dataf$author)

tic()
fitted_tmf = tmfast(
      dtm,
      n = c(5, n_authors, n_authors + 5),
      row = title,
      column = term,
      value = n
)
toc()

screeplot(fitted_tmf, npcs = n_authors + 5)
```


## Topic exploration

Without renormalization, most of the works are spread across a few topics, and the topics don't clearly correspond to authors.  

```{r}
tidy(fitted_tmf, n_authors, 'gamma') |>
      left_join(meta_df, by = c('document' = 'title')) |>
      ggplot(aes(document, gamma, fill = topic)) +
      geom_col() +
      facet_wrap(vars(author), scales = 'free_x') +
      scale_x_discrete(guide = 'none') +
      scale_fill_viridis_d()
```

To renormalize, we need to choose a theoretical Dirichlet distribution. 

```{r}
alpha = peak_alpha(n_authors, 1, peak = .8, scale = 10)
target_entropy = expected_entropy(alpha)
target_entropy

exponent = tidy(fitted_tmf, n_authors, 'gamma') |>
      target_power(document, gamma, target_entropy)
exponent

tidy(fitted_tmf, n_authors, 'gamma', exponent = exponent) |>
      left_join(meta_df, by = c('document' = 'title')) |>
      ggplot(aes(document, gamma, fill = topic)) +
      geom_col() +
      facet_wrap(vars(author), scales = 'free_x') +
      scale_x_discrete(guide = 'none') +
      scale_fill_viridis_d()

tidy(fitted_tmf, n_authors, 'gamma', exponent = exponent) |>
      left_join(meta_df, by = c('document' = 'title')) |>
      ggplot(aes(topic, document, fill = gamma)) +
      geom_raster() +
      facet_grid(rows = vars(author), scales = 'free_y', switch = 'y') +
      scale_y_discrete(guide = 'none') +
      theme(strip.text.y.left = element_text(angle = 0))
```

After renormalization, there are distinctive topics for Alcott (5), Austen (4), and Wells (6 and 10). The Brontë sisters appear in topic 9, along with Eliot and Shelley; Charlotte and Shelley share topic 1. Dickens, Eliot, and Lovecraft share topic 7. Dickens, Lovecraft, and Wells are all spread across multiple topics. 

To aid interpretation, we create a crosswalk dataframe connecting topics to authors. 
```{r}
topic_author = tribble(
      ~topic , ~authors                    ,
      'V01'  , 'C. Brontë, Shelley'        ,
      'V02'  , 'Dickens and Lovecraft'     ,
      'V03'  , 'Dickens'                   ,
      'V04'  , 'Austen and Shelley'        ,
      'V05'  , 'Alcott'                    ,
      'V06'  , 'Wells'                     ,
      'V07'  , 'Dickens, Eliot, Lovecraft' ,
      'V08'  , 'Dickens'                   ,
      'V09'  , 'Brontës, Eliot, Shelley'   ,
      'V10'  , 'Dickens, Lovecraft, Wells'
)
```

To explore these topics further, we turn to the word-topic distribution.  We renormalize these distributions, as with the topic-doc distributions. 

```{r}
target_entropy_term = expected_entropy(5, k = vocab_size)
target_entropy_term

exponent_term = tidy(fitted_tmf, n_authors, 'beta') |>
      target_power(topic, beta, target_entropy_term)
exponent_term

beta_df = tidy(fitted_tmf, n_authors, 'beta', exponent = exponent_term)
```

After renormalization we construct a Silge plot, showing the top 10 terms for each topic.  `tidytext::reorder_within()` and `tidytext::scale_x_reordered()` are useful for constructing this plot. 

```{r}
top_terms = beta_df |>
      group_by(topic) |>
      arrange(topic, desc(beta)) |>
      top_n(15, beta) |>
      left_join(topic_author, by = 'topic')
top_terms

top_terms |>
      mutate(token = reorder_within(token, by = beta, within = topic)) |>
      ggplot(aes(token, beta)) +
      geom_point() +
      geom_segment(aes(xend = token), yend = 0) +
      facet_wrap(vars(topic, authors), scales = 'free_y') +
      coord_flip() +
      scale_x_reordered()
```

Most topics focus on character names, with three of the four Dickens topics corresponding to *The Pickwick Papers*, *Oliver Twist*, and *David Copperfield*.  Wells' topics appear to distinguish non-fiction essays (topic 6) from fiction (10).  Topic 1 groups together Charlotte Brontë and Shelley based on the use of French.