What is rsonpath?

The rsonpath project comprises two parts:

  • the CLI tool rq for blazingly fast querying of JSON files from the command line; and
  • the underlying rsonpath-lib Rust crate allowing one to run structure-aware JSONPath queries on JSONs as easily as one would query a string with the regex crate.

It is both a production-ready app and crate, and a research project, aiming to be the fastest possible JSONPath implementation in a streaming setting.

It is perfectly suited for command line use, able to handle files that would not fit into the main memory. Its minimal memory footprint makes the crate a great choice where one does not want to pay in allocations to extract data from a JSON payload.

Why choose rsonpath?

If you work with JSONs a lot, a CLI tool for extracting data from your files or API call responses can be invaluable.

The most popular tool for working with JSONs is jq, but it has its shortfalls:

  • it is extremely slow 1;
  • it has a massive memory overhead 2;
  • its query language is non-standard.

To be clear, jq is a great and well-tested tool, and rq does not directly compete with it. If one could describe jq as a “sed or awk for JSON”, then rq would be a “grep for JSON”. It does not allow you to slice and reorganize JSON data like jq, but instead outclasses it on the filtering and querying applications.

rq

The rq CLI app can process JSON documents streamed into stdin or from a file, outputting query matches to stdout. It has a minimal memory footprint and processes the input as a stream, maximizing performance.

When to choose rq?

  • when you need a general-purpose JSONPath CLI tool; OR
  • when working with really big JSON files (gigabytes of size), where other tools take too long or run out of memory; OR
  • when the input is a stream with possibly long delays between chunks, for example a network socket.

When does rq fall short?

  • when Unicode escape sequences are used (issue #117);
  • when advanced JSONPath selectors are required (area: selector issues);
  • when targetting a no-std environment 3.

rsonpath-lib

The rsonpath-lib crate is a JSONPath library serving as the backend of rq. It is a separate product, providing a wider API surface and extensibility options beyond those of the CLI.

When to choose rsonpath-lib?

  • when an application spends a lot of time querying JSON documents (or parsing those for querying); OR
  • the application only needs to parse and create a DOM for parts of the JSON that are extracted by a query;
  • when a minimal memory footprint of JSON processing is desired; OR
  • when JSON data comes in a stream (a Read impl) and can be queried in-flight;
  • when a tested JSONPath parser is needed for custom JSONPath processing.

When does rsonpath-lib fall short?

  • when the entire JSON document needs to be parsed into an in-memory model anyway for further processing 4;
  • when Unicode escape sequences are used (issue #117);
  • when advanced JSONPath selectors are required (area: selector issues);
  • when targetting a no-std environment 3.

1

Even on queries adversarial to rq it can be up to faster than jq, which takes over a second to process a MB file.

2

jq can consume upwards of the size of the JSON document in heap memory. On a MB file it reaches a peak of MB. On the same file and query rq uses KB – a miniscule fraction of the file size.

3

As far as we are aware there are no Rust JSON query engines that would target no-std. It would be possible for rsonpath to require only alloc and no std – if this is a feature you would like to see, please let us know.

4

Performance gains of rsonpath are nullified then, since there is no benefit of a rapid, low memory query processor if the full document gets parsed later anyway. In such a case, serde_json_path or a different crate could suit one better. Note that restricting parsing to fragments of a document returned by a filtering query can still yield important gains.

Who is this book for?

The book is divided into three parts, each targeted at a different audience.

  1. Part I – CLI User Guide is aimed at users of rq. It covers installation and basic usage, JSONPath query language reference, and advanced tips and tricks on juicing every last bit of performance out of rq.

  2. Part II – Library User Guide is aimed at developers looking to utilize rsonpath-lib in their projects. It contains an overview of the API surface and a breakdown of all configuration knobs that can be tuned for performance.

  3. Part III – Developer Guide is aimed at developers looking to contribute to rsonpath. It goes into the nitty-gritty details of the codebase, but should still be relatively broad and approachable.

Regardless of which category of users you fall into, you need to understand JSONPath queries. We describe the semantics in JSONPath Reference.

Authors

See Acknowledgements for references, citations, and special thanks.

This book is maintained as part of the rsonpath project, and thus is a collective work of the contributors. The reader should note, however, that I, Mateusz Gienieczko, am the primary editor, and I take responsibility for the contents within.

Unless otherwise specifically stated, all the contents of the book are licensed under the MIT license, excluding any external content to which hyperlinks appear in the text.

Introduction

This part of the book will describe everything you need to know to utilize rq in your workload, starting from installation and basic usage, all the way to fine-tuning its performance using advanced configuration.

Installation

Currently, the easiest way to get it is from latest GitHub release. We have a binary for each Tier 1 Rust target.

Verifying provenance

All of our binary distributions implement SLSA level 3. What that means is that any official rq binary can be verified to have been built from a specific version of rsonpath source with our official GitHub Release CI. This is called provenance.

To verify provenance you should investigate the multiple.intoto.jsonl file available in the GitHub release (in the standard in-toto format), using the slsa-verifier tool.

For example, to verify the rq-x86_64-unknown-linux-gnu binary for version v0.8.0, run:

$ slsa-verifier verify-artifact \
$ --provenance-path ./multiple.intoto.jsonl \ # Path to the released provenance file.
$ --source-uri github.com/V0ldek/rsonpath \   # Our repository URL. This is case sensitive!
$ --source-versioned-tag v0.8.0 \             # Version tag of our release, in the format v#.#.#
$ ./rq-x86_64-unknown-linux-gnu               # Path to the binary to verify.
Verified signature against tlog entry index 34193532 at URL: https://rekor.sigstore.dev/api/v1/log/entries/24296fb24b8ad77a576a14ffb58e0477203bcd311b396b9a4c8c3cc66484053a451b67faf87c1542
Verified build using builder "https://github.com/slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@refs/tags/v1.9.0" at commit 5e6d505182213df857c2b1cb026abf79cf3b54df
Verifying artifact ./rq-x86_64-unknown-linux-gnu: PASSED

PASSED: Verified SLSA provenance

PASSED guarantees that this is a properly signed, untampered-with binary generated from our repository at a given version tag. It can be safely ran on your system. To verify it works, check if rq is available from your command line:

$ rq -V
rq 0.9.4

Package managers

When released, rq will be available as a package in more distribution, but currently you can install it via cargo.

Install with cargo

The rq binary is contained in the rsonpath crate.

cargo install rsonpath

Manual build for maximum performance

The packaged installation methods are portable and the same executable can be safely shared between different machines with the same basic architecture (x86, ARM).

Building rq for a specific CPU makes it not portable, but creates code explicitly optimized for the machine its built on, enabling better performance.

Building from source

Building from source requires your machine to have the rust tooling available. We default to linking with lld, so you need that as well.

First, clone the rsonpath repository:

git clone https://github.com/V0ldek/rsonpath.git

Building and installing is done most easily with just:

just install-native

Without just one can use:

RUSTFLAGS="-C target-cpu=native" cargo install --path ./crates/rsonpath

Building from crates.io

You can enable native CPU codegen when installing from crates.io as well, by overriding rustc flags.

RUSTFLAGS="-C target-cpu=native" cargo install rsonpath

Verifying native optimizations are enabled

To verify that your rq installation has native CPU support, consult rq --version and look for target-cpu=native in the “Codegen flags” field.

$ rq --version
rq 0.9.1

Commit SHA:      05ced6146b2dcc4e474f2dbc17c2e6d0986a7181
Features:        default,simd
Opt level:       3
Target triple:   x86_64-unknown-linux-gnu
Codegen flags:   target-cpu=native,link-arg=-fuse-ld=lld
SIMD support:    avx2;fast_quotes;fast_popcnt

Usage

Running rq requires a JSONPath query, and a valid JSON input. The query is always provided inline, while the input can come from a file, standard input, or an argument.

Input mode

The rq app supports three different input sources.

Input from file

The primary input mode is from a JSON file specified as the second positional argument. For example, if there’s a file in the current directory called ex.json with the contents:

{
    "values": [
        {
            "key": "key1",
            "value": "value1"
        },
        {
            "key": "key2",
            "value": "value2"
        }
    ]
}

then we can run the query by specifying ./ex.json as the file path:

$ rq '$..[*].key' ./ex.json
"key1"
"key2"

Inline input

JSON can be passed directly with the --json argument:

$ rq '$..*' --json '{ "a": 42, "b": "val" }'
42
"val"

This is sometimes more ergonomic when the document is very small.

Input from stdin

If an input is not provided with other means, rq reads from standard input.

Note: if the input is a file, it is always more efficient to provide it as a path than to pipe it to rq’s standard input. Doing cat $file | rq $query is an antipattern.

Output mode

By default rq outputs all matched values, in the order they occur in the document. Of note is that the original formatting is preserved. For example, if pretty.json contains:

{
    "key": {
        "contents": 0
    }
}

then extracting the nested object will result in:

$ rq '$.key' ./pretty.json
{
        "contents": 0
    }

You can see all the original whitespace preserved.1

Count result mode

Sometimes the concrete matches are not interesting, and we only want to count how many matches there are. This can be done much more efficiently than full matches, and can be enabled by passing count to the --result flag (or its -r shorthand).

$ rq '$[*]' --json '[0,1,2,3]' -r count
4

Indices result mode

There is also a result mode that outputs the byte offset in the input document. This is sometimes useful when you have access to the file and want to perform post-query custom parsing on the values by correlating the indices with the original file.

$ rq '$[*]' --json '[0,1,2,3]' -r indices
1
3
5
7

Advanced input options

There are many different ways in which rq could read the provided input. By default it tries its best to decide on the best method. For example, in file mode it uses memory maps when the files are large.

This might be problematic if memory maps are not available on your machine, or are very slow for some reason. In that case you can manually override the input mode with the --force-input argument.

The three modes available are:

  • mmap – always use memory maps;
  • eager – read entire contents of the file or stdin to memory, run the query after; this makes sense for input documents that are not excessively large;
  • buffered – read the contents in a buffered manner; this is good for inputs that are very large or have low write throughput.
1

Reformatting the output would decrease performance, and doing it quickly (for rsonpath standards) would take a lot of effort. It is not impossible, however; if this is a serious issue for your use case, please, raise an issue.

JSONPath reference

Regardless of whether you want to use rq, the rsonpath-lib library, or contribute to the project, you should be familiar with JSONPath, the core query language we use to process JSONs.

The JSONPath language is defined by an IETF specification, currently in draft. The rsonpath project implements a subset of the language according to the spec with two major differences outlined in rsonpath-specific behavior.

The below reference uses terminology from the spec, but tries to use less dry language. If you already know the spec, you can probably skip this chapter.

JSONs as trees

A JSON document is a tree structure, defined in the intuitive way. A node is either an atomic value, i.e. a number, string, true, false, or null, or a complex value, i.e. an object or a list.

An object is a collection of members identified by member names or keys. Each member name has a single child node associated. A list is an ordered collection of child nodes identified by a zero-based index.

Anatomy of a query

A JSONPath query, in essence, defines a pattern that a path in a JSON must match for the node at that path to be selected. The simplest query is a sequence of keys.

$.a.b.c.d

It will access the value of the "a" key in the root, then the value under the "b" key in that object, then the value under "c", and finally the value under "d". For example, in the JSON:

{
    "a": { "b": { "c": { "d": 42 } } }
}

it will access the value 42 by digging into the structure key by key.

$ rq '$.a.b.c.d' --json '{ "a": { "b": { "c": { "d": 42 } } } }'
42

In general, a JSONPath query is a sequence of segments. Each segment contains one or more selectors. Canonically, selectors are delimited within square brackets, but some selectors have a shorthand dot-notation. For example, the query above is equivalent to:

$['a']['b']['c']['d']
$ rq "$['a']['b']['c']['d']" --json '{ "a": { "b": { "c": { "d": 42 } } } }'
42

A valid query starts with the $ character, which represents the root of the JSON. In particular, the query $ simply selects the entire document.

Segments

There are two types of segments:

  • child segment selects immediate children, or, in other words, digs into the structure of the document one level deeper. A child segment is either a bracketed sequence of selectors [<sel1>, ..., <selN>], or a shorthand dot notation like .a or .*.

  • descendant segment selects any subdocument, or, in other words, digs into the structure of the document at any level deeper. A descendant segment is either a bracketed sequence of selectors preceded by two dots ..[<sel1>, ..., <selN>], or a shorthand double-dot notation like ..a or ..*.

Selectors

Note that we only cover selectors that are currently supported by rsonpath. Issues to support more selectors can be found under the area: selector label.

Name selector

The name selector selects the child node under a given member name. It’s most commonly found under its shorthand form, .key or ..key, which works with simple alphanumeric member names.

In the canonical form, the name has to be enclosed between single or double quotes, and enables escape sequences. For example:

  • .a, ['a'], ["a"] all select a child under the key a.
  • ['"'] selects a child under the key ".
  • ["'"] selects a child under the key '.
  • ['complex name'] selects a child under the key containing a space:
$ rq "$['complex name']" --json '{ "complex name": 42 }'
42

Wildcard selector

The wildcard selector selects any child node, be it under a member name in an object, or a value in a list. It also has a common shorthand form, .* or ..*, whereas the canonical form is [*]. For example, running on:

{
    "a": 42,
    "b": [ 1, 2 ]
}

the query $[*] selects 42, and [ 1, 2 ].

$ rq '$[*]' --json '{ "a": 42, "b": [ 1, 2 ] }'
42
[ 1, 2 ]

Using the descendant selector we can recursively extract elements from the list:

$ rq '$..[*]' --json '{ "a": 42, "b": [ 1, 2 ] }'
42
[ 1, 2 ]
1
2

In general, the query ..* selects all subdocuments of the JSON. It’s not a smart query, as it can create outputs much longer than the source document itself, consuming a lot of resources.

Index selector

The index selector selects a value from a list at a given zero-based index. It only has a bracketed form, [index]. For example, running on:

[ 1, 2, 3 ]
  • the query $[0] selects 1;
  • the query $[1] selects 2;
  • the query $[2] selects 3; and
  • the query $[3] selects nothing, since the list has only 3 elements.
$ rq '$[0]' --json "[ 1, 2, 3 ]"
1

$ rq '$[1]' --json "[ 1, 2, 3 ]"
2

$ rq '$[2]' --json "[ 1, 2, 3 ]"
3

$ rq '$[3]' --json "[ 1, 2, 3 ]"

Combining segments

Segments can be chained arbitrarily to create complex queries. For example, if we have a file ex.json

{
    "firstName": "John",
    "lastName": "Doe",
    "number": "078-05-1120",
    "phoneNumbers": [
        {
            "type": "work",
            "number": "0123-4567-8888"
        },
        {
            "type": "home",
            "number": "0123-4567-8910"
        }
    ],
    "spouse": {
        "firstName": "Jane",
        "lastName": "Doe",
        "number": "078-05-1121",
        "phoneNumbers": [
            {
                "type": "work",
                "number": "0123-4567-9999"
            },
            {
                "type": "home",
                "number": "0123-4567-8910"
            }
        ]
    }
}

we can extract all phone numbers with:

$ rq '$..phoneNumbers[*].number' ./ex.json
"0123-4567-8888"
"0123-4567-8910"
"0123-4567-9999"
"0123-4567-8910"

Note that each part of the query is needed here:

  • the first segment is descendant, so that we pick up both the root’s number array and the one under “spouse”;
  • without specifying the “phoneNumbers” key (for example running $..number) we wouldn’t be able to filter out the two irrelevant “number” keys;
  • the wildcard selector [*] makes sure we select all the numbers, regardless of how long the list may be.

Selector availability

Not all of JSONPath’s functionality is supported by rsonpath as of right now.

Supported segments

SegmentSyntaxSupportedSinceTracking Issue
Child segment (single)[<selector>]✔️v0.1.0
Child segment (multiple)[<selector1>,...,<selectorN>]
Descendant segment (single)..[<selector>]✔️v0.1.0
Descendant segment (multiple)..[<selector1>,...,<selectorN>]

Supported selectors

SelectorSyntaxSupportedSinceTracking Issue
Root$✔️v0.1.0
Name.<member>, [<member>]✔️v0.1.0
Wildcard.*, ..*, [*]✔️v0.4.0
Index (array index)[<index>]✔️v0.5.0
Index (array index from end)[-<index>]
Array slice (forward, positive bounds)[<start>:<end>:<step>]#152
Array slice (forward, arbitrary bounds)[<start>:<end>:<step>]
Array slice (backward, arbitrary bounds)[<start>:<end>:-<step>]
Filters – existential tests[?<path>]#154
Filters – const atom comparisons[?<path> <binop> <atom>]#156
Filters – logical expressions&&, ||, !
Filters – nesting[?<expr>[?<expr>]...]
Filters – arbitrary comparisons[?<path> <binop> <path>]
Filters – function extensions[?func(<path>)]

rsonpath-specific behavior

We try to implement the JSONPath spec as closely as possible. There are currently two major differences between rsonpath’s JSONPath and the standard.

Nested descendant segments

The standard semantics of the descendant segment lead to duplicated results, and a potentially exponential blowup in execution time and output size. In rsonpath we diverge from the spec to guarantee unduplicated results:

$ rq '$..a..a' --json '{ "a": { "a": { "a": 42 } } }'
{ "a": 42 }
42

In standard semantics the value 42 would be matched twice1.

Unicode

Currently rsonpath compares JSON keys bytewise, meaning that labels using Unicode escape sequences will be handled incorrectly.

For example, the key "a" can be equivalently represented by "\u0041".

$ rq '$["a"]' --json '{ "a": 42 }'
42

The above results should be the same if either or both of the "a" characters were replaced with the "\u0041" unicode escape sequence, but rsonpath does not support it at this time. This limitation is known and tracked at #117.


1

The reason behind this is a bit subtle. The standard defines the result as a concatenation of lists of results of executing the rest of the query after a descendant segment, and a recursive execution of the entire query. So the inner "a" key in the example is matched first when evaluating the outermost one, and then again when evaluating the middle "a". We consider this to be counter-intuitive and undesirable.

Query optimization

Our tool’s ambition is to be the fastest JSONPath engine of all. This is done in a similar vain to regex engines, where we try to find the computationally simplest way of performing a given query. This is highly dependent on the query itself, and thus it’s possible to compromise performance by making the query less friendly to the engine.

It’s not always obvious if a query is going to be “nice” to rsonpath. In this chapter we try to outline some common ways of making a query faster by rewriting it to a different, yet equivalent, form.

The operation that rsonpath perform the fastest is looking for the first key when the query starts with a descendant name selector.

There is no way to make use of this automatically, but as a user you might have insight into the schema of the documents that are being queried. Imagine you have a query $.products[*].videoChapters that selects all video chapters from a list of products. It just so happens that in the input document the only occurrences of “videoChapters” are within the “products” list. Therefore, a query ..videoChapters would be equivalent and select exactly the same nodes.

The above example is an actual real-life case. The rewritten query is over ten times faster than the original, so an order of magnitude.

Note that this specifically relates only to the first selector being a descendant selector.

Omitting wildcards

The wildcard selector is relatively expensive, as it forces the engine to closely look at every value it encounters. Using reasoning similar to the one in the previous section it’s sometimes possible to eliminate a wildcard selector by either using a specific name to match, or replacing it with a descendant selector.

Take an extended query from the above example that digs into structure to select the “chapter” key of a video chapter: ..videoChapters[*].chapter. Again, it just so happens that the query ..videoChapters..chapter is equivalent, as all “chapter” keys always occur only ones underneath a “videoChapters” entry. The rewritten query will be faster

Just as well, it is always better to make the query more specific, if possible. The query $['key']<rest> will always be faster than $[*]<rest>.

Avoiding descendant wildcards

The absolute worst query to run is $..*. It requires the engine to look at every value in the document, nullifying most optimizations. When facing performance problems, try to express your query without a descendant wildcard, if possible, or at least to restrict it to a smaller portion of the document. For example, $.key..* will be faster than $..* by itself.

Reporting an issue

We consider performance a paramount feature of rsonpath. If you’re facing queries that are excessively slow for your taste, complain to us by reporting Issues so that we can benchmark against your use case.

Reporting issues

The rsonpath project is under active development. However, we have very limited resources that can be spent on it, so user feedback is crucial to make us aware of issues and decide how to prioritize them.

Bugs

If you find a bug, report it with our bug report form on GitHub.

The most important part of a bug report is an MRE – Minimal Reproducible Example. For rsonpath this is usually the query, and the JSON document that causes the bug. You might not be possible to publish the actual document due to licensing or non-disclosure agreements, but it is often possible to narrow down the issue and publish a small, anonymized example, where keys in the query and the document are replaced with different, meaningless values.

New features/enhancements

If there is a particular feature you really need, or maybe a query that takes too long and we could optimize against, you should create an issue and let us know.

If there is an issue already open related to your problem, comment or upvote that issue to let us know to prioritize that! It’s a small thing, but really helpful.

Security vulnerabilities

Security vulnerabilities follow a separate flow, using GitHub’s private reporting system. Consult Privately reporting a security vulnerability to learn how to open a report.

Introduction

This part of the book is a work in progress.

extern crate rsonpath;
extern crate rsonpath_syntax;
use rsonpath::engine::{Compiler, Engine, RsonpathEngine};
use rsonpath::input::BorrowedBytes;
use rsonpath::result::count::CountRecorder;
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
// Parse a JSONPath query from string.
let query = rsonpath_syntax::parse("$..phoneNumbers[*].number")?;
// Convert the contents to the Input type required by the Engines.
let mut contents = r#"
{
  "person": {
    "name": "John",
    "surname": "Doe",
    "phoneNumbers": [
      {
        "type": "Home",
        "number": "111-222-333"
      },
      {
        "type": "Work",
        "number": "123-456-789"
      }
    ]
  }
}
"#;
let input = BorrowedBytes::new(contents.as_bytes());
// Compile the query. The engine can be reused to run the same query on different contents.
let engine = RsonpathEngine::compile_query(&query)?;
// Count the number of occurrences of elements satisfying the query.
let count = engine.count(&input)?;

assert_eq!(2, count);
Ok(())
}

Introduction

This part of the book is a work in progress.

Acknowledgements

The rsonpath project was inspired by theoretical work by Corentin Barloy, Filip Murlak, and Charles Paperman in Stackless Processing of Streamed Trees.

It would not be possible to create this without prior research into SIMD-accelerated JSON processing, first by Geoff Langdale and Daniel Lemire in Parsing gigabytes of JSON per second and the simdjson project, then by Lin Jiang and Zhijia Zhao in JSONSki: streaming semi-structured data with bit-parallel fast-forwarding and the JSONSki project.

All references and citations can be found in my master’s thesis, Fast execution of JSONPath queries, and in the subsequent paper, Supporting Descendants in SIMD-Accelerated JSONPath. Both are also hosted in this repository in /pdf.

Special thanks to Filip Murlak and Charles Paperman for advising me during my thesis, when most of the fundamentals of the project were born.