--- title: "Working with Arrow and Parquet" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Working with Arrow and Parquet} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## The uint64 problem A5 cell IDs are 64-bit unsigned integers. R has no native `uint64` type, and its `double` can only represent integers exactly up to 2^53. Nearly half of all A5 cell IDs exceed this threshold, so converting them to `double` silently corrupts the data. This is a problem when reading Parquet files that store A5 cell IDs as `uint64` columns — the standard format used by DuckDB, Python, and [geoparquet.io](https://geoparquet.io/). By default, `arrow::read_parquet()` converts `uint64` to R's `double`, losing precision: ```{r naive, eval = requireNamespace("arrow", quietly = TRUE), message = FALSE} library(arrow) library(tibble) library(a5R) # A real A5 cell — Edinburgh at resolution 20 cell <- a5_lonlat_to_cell(-3.19, 55.95, resolution = 20) a5_u64_to_hex(cell) # Write to Parquet as uint64 (the standard interchange format) tf <- tempfile(fileext = ".parquet") arrow::write_parquet( arrow::arrow_table(cell_id = a5_cell_to_arrow(cell)), tf ) # Read it back naively — arrow silently converts uint64 to double (naive <- tibble(arrow::read_parquet(tf))) cell_as_dbl <- naive$cell_id # The double can't distinguish this cell from nearby IDs cell_as_dbl == cell_as_dbl + 1 # TRUE — silent corruption cell_as_dbl == cell_as_dbl + 100 # still TRUE ``` ## The solution: `a5_cell_from_arrow()` and `a5_cell_to_arrow()` a5R provides two functions that bypass the lossy `double` conversion entirely, using Arrow's zero-copy `View()` to reinterpret the raw bytes: ```{r bridge, eval = requireNamespace("arrow", quietly = TRUE)} library(a5R) library(tibble) # Six cities across the globe — some will have bit 63 set (origin >= 6) cities <- tibble( name = c("Edinburgh", "Tokyo", "São Paulo", "Nairobi", "Anchorage", "Sydney"), lon = c( -3.19, 139.69, -46.63, 36.82, -149.90, 151.21), lat = c( 55.95, 35.69, -23.55, -1.29, 61.22, -33.87) ) cities$cell <- a5_lonlat_to_cell(cities$lon, cities$lat, resolution = 10) cities ``` These cells work seamlessly in tibbles. Now let's enrich the data with some A5 operations — cell resolution and distance from Edinburgh: ```{r enrich, eval = requireNamespace("arrow", quietly = TRUE)} edinburgh <- cities$cell[1] cities$resolution <- a5_get_resolution(cities$cell) cities$dist_from_edinburgh_km <- as.numeric( a5_cell_distance(cities$cell, rep(edinburgh, nrow(cities)), units = "km") ) cities ``` ## Writing and reading Parquet Convert to an Arrow table and write to Parquet. The cell column is stored as native `uint64` — the same binary format used by DuckDB, Python, and geoparquet.io: ```{r parquet_write, eval = requireNamespace("arrow", quietly = TRUE)} tf <- tempfile(fileext = ".parquet") arrow_tbl <- arrow::arrow_table( name = cities$name, cell_id = a5_cell_to_arrow(cities$cell), cell_res = cities$resolution, dist_from_edinburgh_km = cities$dist_from_edinburgh_km ) arrow_tbl$schema arrow::write_parquet(arrow_tbl, tf) ``` Read it back — `a5_cell_from_arrow()` recovers the exact cell IDs without any precision loss: ```{r parquet_read, eval = requireNamespace("arrow", quietly = TRUE)} pq <- arrow::read_parquet(tf, as_data_frame = FALSE) # Recover cells from the uint64 column, bind with the rest of the data recovered_cells <- a5_cell_from_arrow(pq$column(1)) result <- as.data.frame(pq) result$cell <- recovered_cells result <- tibble::as_tibble(result[c("name", "cell", "cell_res", "dist_from_edinburgh_km")]) result ``` Verify the round-trip is lossless: ```{r verify, eval = requireNamespace("arrow", quietly = TRUE)} identical(format(cities$cell), format(result$cell)) ``` ## How it works under the hood 1. **`a5_cell_to_arrow()`**: packs the eight raw-byte fields into 8-byte little-endian blobs (one per cell), creates an Arrow `fixed_size_binary(8)` array, then uses `View(uint64)` to reinterpret the bytes as unsigned 64-bit integers — zero-copy. 2. **`a5_cell_from_arrow()`**: does the reverse — `View(fixed_size_binary(8))` on the `uint64` array to get the raw bytes, then unpacks each 8-byte blob into the eight raw-byte fields used by `a5_cell`. The raw bytes never pass through `double`, so there is no precision loss at any step. See `vignette("internal-cell-representation")` for details on the raw-byte representation.