--- title: "YAML Tags, Anchors, and Advanced Features with yaml12" output: rmarkdown::html_vignette editor_options: markdown: wrap: 72 canonical: true vignette: > %\VignetteIndexEntry{YAML Tags, Anchors, and Advanced Features with yaml12} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(yaml12) ``` This vignette picks up where the “YAML in 2 Minutes” intro leaves off. It shows what YAML tags are and how to work with them in yaml12 with tag handlers. Along the way we also cover complex-valued keys and node anchors, so you can work with any advanced YAML (version 1.2) you might see in the wild. ## Tags in YAML and how yaml12 handles them Tags annotate any YAML node with extra meaning. In YAML syntax a tag always starts with `!`, and it appears before the node’s value; it is not part of the scalar text itself. *yaml12* attaches tags as a `yaml_tag` attribute (always a string). The most common form is a simple local short tag that starts with `!`: ```{r} dput(parse_yaml("!some_tag some_value")) ``` The presence of a custom tag bypasses normal scalar node type inference; the scalar is always returned as a string even when the content looks like another type. ```{r} parse_yaml("! true") parse_yaml("true") ``` ## Using handlers to transform tagged nodes while parsing `parse_yaml()` and `read_yaml()` accept `handlers`: a named list of functions whose names are YAML tag strings. Handlers run on any matching tagged node. For tagged scalars, the handler always receives a length-1 string; for tagged sequences or mappings, it receives an R vector representing that node. Here is an example of using a handler to evaluate `!expr` nodes. ```{r} handlers <- list( "!expr" = function(x) eval(str2lang(x), globalenv()) ) parse_yaml("!expr 1+1", handlers = handlers) ``` Any errors from a handler stop parsing: ```{r, error = TRUE} parse_yaml("!expr stop('boom')", handlers = handlers) ``` Any tag without a matching handler is left preserved as `yaml_tag` attribute, and handlers without matching tags are left unused. ```{r} handlers <- list( "!expr" = function(x) eval(str2lang(x), globalenv()), "!upper" = toupper, "!lower" = tolower # unused ) str(parse_yaml(handlers = handlers, " - !expr 1+1 - !upper r is awesome - !note this tag has no handler ")) ``` With a tagged sequence, the handler is called with an unnamed R list, or an atomic vector if `simplify = TRUE` and all the sequence elements are a common type. With tagged mappings the handler is called with a named R list, potentially with a `yaml_keys` attribute (more on this in the next section). ```{r} handlers <- list( "!some_seq_tag" = function(x) { stopifnot(identical(x, c("a", "b"))) "some handled value" }, "!some_map_tag" = function(x) { stopifnot(identical(x, list(key1 = 1L, key2 = 2L))) "some other handled value" } ) yaml_tagged_containers <- " - !some_seq_tag [a, b] - !some_map_tag {key1: 1, key2: 2} " str(parse_yaml(yaml_tagged_containers, handlers = handlers)) ``` Handlers make it easy to opt into powerful behaviors (like evaluating `!expr` nodes) while keeping the default parser strict and safe. ### Post-process tags yourself If you want more control, you can parse first without `handlers` and then walk the result yourself. For example, you can process `!expr`-tagged scalars yourself by walking the yaml nodes simply like this: ```{r} eval_yaml_expr_nodes <- function(x) { if (is.list(x)) { x <- lapply(x, eval_yaml_expr_nodes) } else if (identical(attr(x, "yaml_tag", TRUE), "!expr")) { x <- eval(str2lang(x), globalenv()) } x } safe_loaded <- parse_yaml("!expr 1 + 1") dput(safe_loaded) eval_yaml_expr_nodes(safe_loaded) ``` ## Mappings revisited: non-string keys and `yaml_keys` In YAML, mapping keys do not have to be plain scalar strings; any arbitrary YAML node can be a key: including other scalar types, sequences, and even other mappings. For example, this is valid YAML even though the key is a boolean: ``` yaml true: true ``` When *yaml12* sees a mapping key that is not a untagged string scalar, it keeps the original keys in a `yaml_keys` attribute next to the values: ```{r} dput(parse_yaml("true: true")) ``` For complex key values, YAML uses the explicit mapping-key indicator `?`. A line starting with ? introduces the key node (of any type) of a mapping, and the following line that starts with `:` holds its value: ``` yaml ? [a, b] : value with a sequence key ? {x: 1, y: 2} : value with a mapping key ``` In yaml12 you can see those keys via the `yaml_keys` attribute: ```{r} yaml <- " true: true ? [a, b] : tuple ? {x: 1, y: 2} : map-key " str(parse_yaml(yaml)) ``` ### Tagged mapping keys If you supply handlers, they run on keys as well, so a handler can turn tagged keys into friendly R names before `yaml_keys` needs to be attached. If all the mapping keys resolve to bare scalar strings, then a `yaml_keys` attribute is not attached. ```{r} handlers <- list( "!upper" = toupper, "!airport" = function(x) paste0("IATA:", toupper(x)) ) yaml_tagged_key <- " !upper newyork: !airport jfk !upper warsaw: !airport waw " str(parse_yaml(yaml_tagged_key, handlers = handlers)) ``` If you anticipate tagged mapping keys that you want to process yourself, you'll need a bit more bookkeeping. The `yaml_keys` attribute is materialized whenever any key is not a plain, untagged string scalar; you'll want to walk those keys alongside the values and optionally collapse `yaml_keys` back to `NULL` if all keys become plain strings after handling tagged nodes. For example, here is the earlier `eval_yaml_expr_nodes` expanded to also handle tagged mapping keys. (This expanded postprocessor is equivalent to passing `handlers = list("!expr" = \(x) eval(str2lang(x), globalenv()))`) ```{r} is_bare_string <- \(x) { is.character(key) && length(key) == 1L && is.null(attributes(key)) } eval_yaml_expr_nodes <- function(x) { if (is.list(x)) { x <- lapply(x, eval_yaml_expr_nodes) if (!is.null(keys <- attr(x, "yaml_keys", TRUE))) { keys <- lapply(keys, eval_yaml_expr_nodes) names(x) <- sapply( \(name, key) if (name == "" && is_bare_string(key)) key else name, names(x), keys ) attr(x, "yaml_keys") <- if (all(sapply(keys, is_bare_string))) NULL else keys } } if (identical(attr(x, "yaml_tag", TRUE), "!expr")) { x <- eval(str2lang(x), globalenv()) } x } ``` Because you control the traversal, you can add extra checks (for example, only allowing expressions under certain mapping keys). ## Document Streams and Markers Most YAML files contain a single YAML *document*. YAML also supports *document streams*, where a file or string holds multiple YAML documents. Documents are separated by a start marker (`---`) and may optionally include an end marker (`...`). ### Reading Multiple Documents For the reading functions (`read_yaml()`, `parse_yaml()`), the `multi` argument defaults to `FALSE`. In this mode, only the first YAML document is read. If an end marker (`...`) or a new start marker (`---`) is encountered, the parser stops and returns only the first document. When `multi = TRUE`, all documents in the stream are returned. ```{r} doc_stream <- " --- doc 1 --- doc 2 " parse_yaml(doc_stream) parse_yaml(doc_stream, multi = TRUE) ``` ### Writing Multiple Documents For the writing functions (`write_yaml()`, `format_yaml()`), `multi` also defaults to `FALSE`, producing a single YAML document. When `multi = TRUE`, the provided R object is treated as a list of documents and written as a YAML document stream, with documents separated by the start marker `---`. Regardless of `multi`, `write_yaml()` always includes an initial start marker and a final end marker. ```{r} write_yaml(list("foo", "bar")) write_yaml(list("foo", "bar"), multi = TRUE) ``` When `multi = FALSE`, parsing stops after the first document—even if later content is not valid YAML. That makes it easy to extract front matter from files that mix YAML with other text (like R Markdown): ```{r} rmd_lines <- c( "---", "title: Front matter only", "params:", " answer: 42", "---", "# Body that is not YAML" ) parse_yaml(rmd_lines) ``` Here the parser returns just the YAML frontmatter because the second `---` technically ends the first *YAML document* in a *YAML document stream*; with `multi = FALSE` the parser stops there and returns just the first YAML document. ## Writing YAML with tags To emit a tag, attach `yaml_tag` to an R value before calling `format_yaml()` or `write_yaml()`. ```{r} tagged <- structure("1 + x", yaml_tag = "!expr") write_yaml(tagged) ``` ## Anchors Anchors (`&id`) name a node; aliases (`*id`) copy it. yaml12 resolves aliases before returning R objects. ```{r} str(parse_yaml(" recycle-me: &anchor-name a: b c: d recycled: - *anchor-name - *anchor-name ")) ``` ## Debugging If you want to inspect how YAML nodes are parsed directly, you can reach for the internal helper `yaml12:::dbg_yaml()` to print the raw (Rust) `saphyr::Yaml` structures without converting to R objects. ## (Very) Advanced Tags The following are some YAML features that are rarely used, but are supported for 100% compliance with the YAML 1.2 spec. ### Tag directives (`%TAG`) YAML lets you declare tag handles at the top of a document with `%TAG` directives. The syntax is `%TAG !! `, and it applies to the rest of the document. The `!name!` is then automatically expanded in named tags: ``` yaml %TAG !e! tag:example.com,2024:widgets/ --- item: !e!gizmo ``` Here the tag prefix `!e!` is automatically expanded to the full form upon parsing. ```{r} dput(parse_yaml(' %TAG !e! tag:example.com,2024:widgets/ --- item: !e!gizmo foo ')) ``` You can also declare a global tag prefix, which will expand a bare "!" ```{r} dput(parse_yaml(' %TAG ! tag:example.com,2024:widgets/ --- item: !gizmo foo ')) ``` ### TAG URIs The above two forms are actually shorthands for resolving a tag "URI". You can bypass handle resolution by using the following tag syntax: `!<...>` Anything in `...` will not be expanded, but must be a valid URI (e.g., spaces must be escaped, like in a URL). ```{r} dput(parse_yaml(' %TAG ! tag:example.com,2024:widgets/ --- item: ! foo ')) ``` ### Core schema tags You may also encounter tags that start with two `!!`. This is a special case of the `!name!suffix` tag syntax, where `name` is missing and undefined and implicitly resolved to the YAML Core schema handle: `tag:yaml.org,2002:`. The following three tags all resolve to the same internal representation and parse the same way: ```{r} ' - foo - !!str foo - ! foo ' |> parse_yaml() |> dput() ``` Core schema tags are generally unnecessary since all nodes are resolved using the core schema already. However, they can be an alternative way to declare node types. The valid set of core schema tags: `map`, `seq`, `str`, `int`, `float`, `bool`, `null`. Note that YAML 1.2 [removed](https://yaml.org/spec/1.2.2/ext/changes/) some built-in types that were present in YAML 1.1. > The !!pairs, !!omap, !!set, !!timestamp and !!binary types have been > dropped. Correspondingly, in *yaml12* these formerly core tags come into R as any other unhandled tagged scalar, as strings with a `yaml_tag` attribute. Note that the `pairs`, `omap`, and `set` are generally not meaningful in R, since all the R objects returned are ordered (and *yaml12* automatically preserves the order of mapping entries). The [`!!timestamp`](https://yaml.org/type/timestamp.html) and [`!!binary`](https://yaml.org/type/binary.html) tags are occasionally useful, but the logic for handing richer types is encouraged to live at the application level, and not in the core schema. Note that with `!!`, the parser expands the first global prefix to `tag:yaml.org,2002:` (unless a tag directive changed the meaning of `!!`), and the tags come in with a fully resolved core schema URI. ```{r} yaml <- " - !!timestamp 2025-01-01 - !!timestamp 2025-01-01 21:59:43.10 -5 - !!binary UiBpcyBBd2Vzb21l " str(parse_yaml(yaml)) ``` You can supply a handler for them if you want to convert them from a character string to some other R object: ```{r} # Timestamp handler: Convert date-only into Date, otherwise try (some of) the # YAML 1.1 spec valid timestamp formats as POSIX formats. # return NA on failure. timestamp_handler <- function(x) { stopifnot(is.character(x), length(x) == 1) if (grepl("^\\d{4}-\\d{2}-\\d{2}$", x)) { return(as.Date(x)) } formats <- c( "%Y-%m-%dT%H:%M:%OS%z", "%Y-%m-%d %H:%M:%OS%z", "%Y-%m-%dT%H:%M:%OS", "%Y-%m-%d %H:%M:%OS", "%Y-%m-%d %H:%M" ) as.POSIXct(x, tryFormats = formats, optional = TRUE) } # Binary handler: decode Base64 into raw binary_handler <- function(x) { stopifnot(is.character(x), length(x) == 1) jsonlite::base64_dec(gsub("[ \n]", "", x)) } ``` ```{r} str(parse_yaml(yaml, handlers = list( "tag:yaml.org,2002:timestamp" = timestamp_handler, "tag:yaml.org,2002:binary" = binary_handler ))) ```