--- title: "Co-occurrence Networks with IMDB Movie Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Co-occurrence Networks with IMDB Movie Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, message = FALSE, warning = FALSE ) ``` This tutorial demonstrates every feature of the `cooccure` package using 1,000 highly-rated IMDB movies (rating $\geq$ 7.0, $\geq$ 1,000 votes, 1970--2024). Starting from raw tabular data, we build genre co-occurrence networks, construct actor collaboration networks, compare co-occurrence patterns across decades and rating bands, apply different similarity measures and counting methods, and export results to `Gephi`, `igraph`, and `cograph` for visualization and downstream analysis. ## Data ```{r load-data} library(cooccure) ``` The dataset contains the title of the movie, the year and decade, the `genres` (comma-separated), the average rating, the number of votes, and the rating band. ```{r} head(movies) ``` ## 1. Genre co-occurrence (delimited field) Each movie has a comma-delimited `genres` column, where genres that appear in the same movie are connected. To analyze their co-occurrence we need to call `cooccurrence` and specify the `field` argument specifying the column and `sep` specifying the delimiter that separates values within that field. ```{r genre-basic} cooccurrence(movies, field = "genres", sep = ",") ``` The result shows 22 genre nodes and 129 edges. *Drama* dominates the top pairs simply because it is the most frequent genre, inflating co-occurrence with nearly every other genre regardless of actual association strength. Jaccard similarity normalizes for this, so the strongest edges reflect true affinity rather than frequency. A directed co-occurrence graph is then built and its degree distribution plotted as a bar chart, showing how many nodes share each degree value and whether connectivity is concentrated or evenly spread across genres. ```{r genre-jaccard} library(cograph) Net <- co(movies, field = "genres", sep = ",", similarity = "jaccard") Gr<- as_cograph(Net, directed = TRUE) degree_distribution(Gr) ``` ### Comparing similarity measures Each similarity measure surfaces different structure in the data, prioritizing different aspects of co-occurrence. The examples below show the top 3 pairs under each measure using a movie genres dataset. - `similarity = "none"` — Raw counts favor the most frequent genre pairs. *Drama* dominates because it is the most common genre, so its co-occurrences with *Comedy* and *Romance* rank highest regardless of how strongly the genres are actually associated. ```{r sim-none} co(movies, field = "genres", sep = ",", similarity = "none", top_n = 3) ``` - `similarity = "jaccard"` — Normalizing by the union of occurrences brings less frequent but more tightly associated pairs to the top. *Adventure*–*Animation* emerges as the strongest pair, suggesting these genres appear together more consistently relative to how often either appears alone. ```{r sim-jaccard} co(movies, field = "genres", sep = ",", similarity = "jaccard", top_n = 3) ``` - `similarity = "cosine"` — Similar to Jaccard but less strict, cosine also elevates *Adventure*–*Animation* while keeping *Drama*–*Romance* in the top 3, reflecting its more lenient treatment of frequency differences between genres. ```{r sim-cosine} co(movies, field = "genres", sep = ",", similarity = "cosine", top_n = 3) ``` - `similarity = "inclusion"` — Dividing by the less common genre's frequency surfaces subset relationships. *Documentary*–*News* and *History*–*News* both score 1, meaning *News* movies are always also tagged as *Documentary* or *History*—*News* is a niche genre that only appears alongside broader ones. ```{r sim-inclusion} co(movies, field = "genres", sep = ",", similarity = "inclusion", top_n = 3) ``` - `similarity = "association"` — Discounting by the product of individual frequencies reveals pairs that co-occur far more than chance would predict. Rare genre combinations like *History*–*News* and *Documentary*–*News* top the list because their co-occurrence is disproportionately high relative to their individual frequencies. ```{r sim-association} co(movies, field = "genres", sep = ",", similarity = "association", top_n = 3) ``` - `similarity = "dice"` — Results closely mirror Jaccard, with the same top pairs appearing in the same order but with slightly higher weights — reflecting Dice's more lenient arithmetic mean normalization. *Adventure*-*Animation*, *Action*-*Crime*, and *Comedy*-*Drama* remain the strongest genre affinities, with *Comedy*-*Drama* recording the highest raw co-occurrence count at 159 despite its comparatively lower weight, indicating that both genres are individually common but not exclusively paired. ```{r sim-dice} co(movies, field = "genres", sep = ",", similarity = "dice", top_n = 3) ``` - `similarity = "equivalence"` — Squaring the cosine amplifies differences between strong and weak pairs, pushing weaker associations further down the rankings. *Adventure*-*Animation* retains the top spot, but *Drama*–*Romance* displaces *Action*-*Crime* into third place, suggesting that while *Action*-*Crime* has a higher raw count, *Drama*–*Romance* represents a tighter and more exclusive pairing once the penalty for common genres is compounded. ```{r sim-equivalence} co(movies, field = "genres", sep = ",", similarity = "equivalence", top_n = 3) ``` ### Which similarity to use? - **Exploratory work**: Start with `"none"` to see raw counts and understand the data, then try `"jaccard"` or `"cosine"` for a balanced view. - **Bibliometric and scientometric networks**: `"association"` is recommended by van Eck & Waltman (2009) because it correctly accounts for the expected number of co-occurrences under independence. Two items that are both very frequent will naturally co-occur often; association strength discounts this, revealing which pairs co-occur *more than chance alone would predict*. - **Detecting hierarchical or subset structure**: `"inclusion"` (the Simpson coefficient) reveals when one item almost always appears with another --- useful for identifying items that are subsets of broader categories, or for detecting dependency relationships. - **Binary presence/absence networks**: `"jaccard"` or `"dice"` when you only care *whether* items co-occur, not *how often*. Jaccard is stricter (penalizes unbalanced pairs more); Dice is more lenient. - **Scale-invariant comparison**: `"cosine"` is invariant to absolute frequency --- useful when comparing co-occurrence patterns across datasets of different sizes. - **Strict filtering**: `"equivalence"` (cosine squared) amplifies differences --- pairs with weak overlap get pushed closer to zero, retaining only the strongest associations. ## 2. Counting methods By default, each transaction contributes equally to co-occurrence counts: if a movie has genres A, B, and C, each pair adds 1. This means movies with many genres inflate the network — a 5-genre movie creates 10 pairs, while a 2-genre movie creates only 1, giving multi-genre movies disproportionate influence. **Fractional counting** addresses this by weighting each pair by $1/(n-1)$, where $n$ is the number of items in the transaction, so every transaction contributes equally regardless of how many genres it contains. ```{r counting-full} co(movies, field = "genres", sep = ",", top_n = 5) ``` ```{r counting-fractional} co(movies, field = "genres", sep = ",", counting = "fractional", top_n = 5) ``` The top pairs remain the same under both methods, but the weights are lower under fractional counting. For example, *Comedy*-*Drama* drops from 159 to 107.5, reflecting the downweighting of multi-genre movies that contributed to this pair. Fractional counting is particularly important when some transactions contain many items while others contain few, as it prevents high-cardinality transactions from dominating the network. ## 3. Scaling Scaling compresses or transforms the weight distribution after similarity normalization, making it easier to visualize or use in downstream analysis. Scaling can be applied on its own or combined with any similarity measure. - `scale = "log"` — Applies a natural log transformation, compressing the heavy tail of the distribution. The ranking of pairs is preserved but the gap between frequent and infrequent pairs is reduced. *Comedy*-*Drama* and Drama*-*Romance* remain at the top, but their weights are now much closer together than raw counts would suggest. ```{r scale-log} co(movies, field = "genres", sep = ",", scale = "log", top_n = 5) ``` - `scale = "minmax"` — Rescales all weights to the range [0, 1], where the strongest pair scores 1 and all others are expressed relative to it. Useful for comparing networks of different sizes or when absolute counts are not meaningful. *Adventure*-*Animation* scores a perfect 1 as the strongest Jaccard pair in the network. *Biography*-*Documentary* reaches the top 5 with a raw count of only 44, indicating that when these genres appear together they do so with high exclusivity relative to how often either appears alone. ```{r scale-minmax} co(movies, field = "genres", sep = ",", similarity = "jaccard", scale = "minmax", top_n = 5) ``` - `scale = "binary"` — Converts all positive weights to 1, producing a presence/absence network. The top pairs are no longer ranked by strength but simply by whether they co-occur at all. *Action*-*Adventure* dominate simply because they appear together across a wide range of combinations. ```{r scale-binary} co(movies, field = "genres", sep = ",", scale = "binary", top_n = 5) ``` - `scale = "sqrt"` — Applies a square root transformation, providing a milder compression than log. The ranking is preserved and the distribution is slightly less skewed than the raw counts. *Comedy*-*Drama* leads with a weight of 12.6 compared to *Comedy*-*Romance* at 7.9, a smaller proportional gap than the raw count difference of 159 versus 63 would imply. ```{r scale-sqrt} co(movies, field = "genres", sep = ",", scale = "sqrt", top_n = 5) ``` Scaling can be combined with any similarity measure and with filtering arguments. The example below applies association strength followed by log scaling, retaining only genres appearing in at least 20 movies. *Drama*-heavy pairs drop out entirely once popularity is penalized, and niche but tightly linked pairs like *Adventure*-*Animation*, *History*-*War*, and *Mystery*-*Thriller* rise to the top, reflecting genres that genuinely cluster together rather than simply co-occurring by volume. ```{r scale-combined} co(movies, field = "genres", sep = ",", similarity = "association", scale = "log", min_occur = 20, top_n = 5) ``` ## 4. Filtering Three filtering arguments control which edges appear in the result. They can be used independently or combined. - `min_occur` — Drops any genre appearing in fewer than the specified number of transactions before co-occurrences are computed, removing rare items that would otherwise inflate the edge count. Here only movies with co-occurrence of minimum 20 are kept. ```{r min-occur} co(movies, field = "genres", sep = ",", similarity = "jaccard", min_occur = 20) ``` - `threshold` — Retains only edges with a weight at or above the specified value, applied after similarity normalization and scaling. Here only pairs with a Jaccard similarity above 0.15 are kept. ```{r threshold} co(movies, field = "genres", sep = ",", similarity = "jaccard", threshold = 0.15) ``` - `top_n` — Keeps only the n strongest edges by weight, regardless of their absolute value. ```{r top-n} co(movies, field = "genres", sep = ",", similarity = "jaccard", top_n = 10) ``` All three thresholds can be combined for fine-grained control over network size and density: ```{r combined-filter} co(movies, field = "genres", sep = ",", similarity = "association", counting = "fractional", min_occur = 15, threshold = 0.001, top_n = 20) ``` ## 5. Actor co-occurrence (long/bipartite format) Actors who appear in the same movie are connected. The data has one row per actor–movie pair, making it a natural fit for the long/bipartite format. The `field` argument specifies the entity column and `by` specifies the grouping column. ```{r actor-net} co(actors, field = "actor", by = "tconst", similarity = "jaccard", min_occur = 3, threshold = 0.1) ``` Julia Bache-Wiig and Robin Ottersen score 1.0, meaning they appear together in every movie either of them appears in — a perfect Jaccard score. Applying fractional counting downweights pairs from movies with large casts, though in this case the results are unchanged because the dataset is small and cast sizes are similar across movies. ```{r actor-frac} co(actors, field = "actor", by = "tconst", similarity = "jaccard", counting = "fractional", min_occur = 3, threshold = 0.05) ``` ## 6. Splitting by groups The split_by argument computes a separate co-occurrence network for each level of a grouping variable and returns all results in a single data frame with a group column. - `split_by = "decade"` — Each decade gets its own Jaccard-weighted genre network. The dominant pairs shift across decades, reflecting how genre combinations have changed over time. ```{r split-decade} co(movies, field = "genres", sep = ",", split_by = "decade", similarity = "jaccard", min_occur = 5, top_n = 5) ``` Individual groups can be extracted by filtering the group column: ```{r filter-decade} decades <- co(movies, field = "genres", sep = ",", split_by = "decade", similarity = "jaccard", min_occur = 5, top_n = 5) decades[decades$group == "2010s", ] ``` - `split_by = "rating_band"` — Splitting by rating band reveals whether highly rated movies have different genre co-occurrence patterns than average-rated ones. *Documentary*–*Music* is the strongest pair among top-rated movies, while *Adventure*–*Animation* leads among the 7–7.9 band. ```{r split-rating} movies$rating_band <- ifelse(movies$averageRating >= 8, "8+", "7-7.9") co(movies, field = "genres", sep = ",", split_by = "rating_band", similarity = "jaccard", min_occur = 10, top_n = 5) ``` ## 7. Output formats The default output format returns a tidy data frame with `from`, `to`, `weight`, and `count` columns, ready for further analysis or visualization. ```{r out-default} co(movies, field = "genres", sep = ",", top_n = 5) ``` ### Gephi The Gephi output format returns a data frame formatted for direct import into Gephi, with `Source`, `Target`, `Weight`, `Type`, and `Count` columns. The result can be written straight to CSV. ```{r out-gephi} co(movies, field = "genres", sep = ",", similarity = "jaccard", output = "gephi", top_n = 10) ``` ```{r gephi-export, eval=FALSE} write.csv( co(movies, field = "genres", sep = ",", similarity = "jaccard", output = "gephi"), "genre_network.csv", row.names = FALSE ) ``` ### cograph (with Gephi layout) The cograph output format returns a `cograph_network` object that can be passed directly to `splot()` for visualization. The layout argument controls the node placement algorithm — `"fr"` uses Fruchterman-Reingold, and `scale_nodes_by = "degree"` sizes nodes by their degree centrality. ```{r out-cograph, fig.width=8, fig.height=8} library(cograph) net <- co(movies, field = "genres", sep = ",", similarity = "jaccard", min_occur = 20, output = "cograph") splot(net, layout = "fr", scale_nodes_by = "degree") ``` Additional styling arguments control `edge_width_range`, `label_size`, `node_color`, and `node_border_width`: ```{r cograph-styled, fig.width=8, fig.height=8} library(cograph) splot(net, layout = "gephi", label_size = .8, label_fontface = "bold", node_fill = "#F9C22E", node_border_width = 0.0001, edge_color = "black", scale_nodes_by = "degree", edge_width_range = c(0.1:4)) ``` ### igraph The igraph output format returns an `igraph` object. All standard igraph functions work on the result without any conversion. ```{r out-igraph} g <- co(movies, field = "genres", sep = ",", similarity = "jaccard", min_occur = 20, output = "igraph") g ``` Centrality measures can be computed directly on the `igraph` object: ```{r igraph-metrics} igraph::degree(g) igraph::betweenness(g) ``` ### Matrix The matrix output format returns a square co-occurrence matrix where rows and columns are items and each cell contains the similarity weight between the corresponding pair. This is useful for downstream analysis that expects a matrix input, such as clustering or heatmap visualization. ```{r out-matrix} mat <- co(movies, field = "genres", sep = ",", similarity = "jaccard", min_occur = 20, output = "matrix") round(mat[1:6, 1:6], 3) ``` ## 8. Converters Converters transform a `cooccurrence` result into other formats after the fact, without re-running the computation. This is useful when you want to start with the default tidy data frame and convert to a specific format only when needed. `as_matrix()` converts the result to a square similarity matrix, where each cell contains the Jaccard weight between the corresponding pair of genres: ```{r conv-matrix} result <- co(movies, field = "genres", sep = ",", similarity = "jaccard", min_occur = 20) as_matrix(result) ``` `as_matrix(result, type = "raw")` returns the raw co-occurrence count matrix instead, with no similarity normalization applied: ```{r conv-raw} as_matrix(result, type = "raw") ``` `as_igraph()` converts the result to an igraph object, giving access to the full igraph ecosystem for further network analysis: ```{r conv-igraph} as_igraph(result) ``` ## 9. Six input formats, one result The same data can be provided in different formats — `cooccure` auto-detects the format and produces identical results regardless of which representation is used. The four examples below all compute the same genre co-occurrence network from the same underlying data. *Delimited field* is the most common format. A single column contains all genres as a comma-separated string, one row per movie. ```{r fmt-delimited} res1 <- co(movies, field = "genres", sep = ",") ``` *Long/bipartite* format uses one row per genre–movie pair. The data is first reshaped from wide to long, then passed to `co()` with `field` specifying the genre column and by specifying the movie identifier. ```{r fmt-long} genre_long <- do.call(rbind, lapply(seq_len(nrow(movies)), function(i) { gs <- trimws(strsplit(movies$genres[i], ",")[[1]]) data.frame(movie_id = movies$tconst[i], genre = gs, stringsAsFactors = FALSE) })) res2 <- co(genre_long, field = "genre", by = "movie_id") ``` *Binary matrix* uses a document-term matrix where rows are movies, columns are genres, and values are 0 or 1. Auto-detected when all values are binary and no `field`, `by`, or `sep` arguments are provided. ```{r fmt-binary} all_genres <- sort(unique(genre_long$genre)) bin <- matrix(0L, nrow = nrow(movies), ncol = length(all_genres), dimnames = list(movies$tconst, all_genres)) for (i in seq_len(nrow(genre_long))) { row <- match(genre_long$movie_id[i], movies$tconst) bin[row, genre_long$genre[i]] <- 1L } res3 <- co(bin) ``` *List of character vectors* is the most direct format, where each list element is a character vector of genres for one movie. ```{r fmt-list} res4 <- co(lapply(strsplit(movies$genres, ","), trimws)) ``` All four produce identical weights, confirming that the format choice is purely a matter of convenience: ```{r fmt-verify} all.equal(res1$weight, res2$weight) all.equal(res1$weight, res3$weight) all.equal(res1$weight, res4$weight) ``` ## 10. Complete pipeline All steps can be combined in a single call and piped directly to `splot()` for visualization. The example below applies Jaccard similarity, fractional counting, min-max scaling, and filtering in one expression, converting the result to a `cograph_network` and rendering it without any intermediate objects. ```{r pipeline, fig.width=8, fig.height=8} co(movies, field = "genres", sep = ",", similarity = "jaccard", counting = "fractional", scale = "minmax", min_occur = 15, threshold = 0.05, output = "cograph") |> splot(layout = "gephi", edge_width = 3, label_size = 0.9, title = "IMDB Genre Co-occurrence (Jaccard, fractional, min 15 movies)") ``` ## References van Eck, N. J., & Waltman, L. (2009). How to normalize cooccurrence data? An analysis of some well‐known similarity measures. *Journal of the American Society for Information Science and Technology*, *60*(8), 1635-1651.