What is rsonpath?
The rsonpath
project comprises two parts:
- the CLI tool
rq
for blazingly fast querying of JSON files from the command line; and - the underlying
rsonpath-lib
Rust crate allowing one to run structure-aware JSONPath queries on JSONs as easily as one would query a string with theregex
crate.
It is both a production-ready app and crate, and a research project, aiming to be the fastest possible JSONPath implementation in a streaming setting.
It is perfectly suited for command line use, able to handle files that would not fit into the main memory. Its minimal memory footprint makes the crate a great choice where one does not want to pay in allocations to extract data from a JSON payload.
Why choose rsonpath
?
If you work with JSONs a lot, a CLI tool for extracting data from your files or API call responses can be invaluable.
The most popular tool for working with JSONs
is jq
, but it has its shortfalls:
To be clear, jq
is a great and well-tested tool, and rq
does not directly
compete with it. If one could describe jq
as a “sed
or awk
for JSON”,
then rq
would be a “grep
for JSON”. It does not allow you to slice and
reorganize JSON data like jq
, but instead outclasses it on the filtering
and querying applications.
rq
The rq
CLI app can process JSON documents streamed into stdin or from a file,
outputting query matches to stdout. It has a minimal memory footprint
and processes the input as a stream, maximizing performance.
When to choose rq
?
- when you need a general-purpose JSONPath CLI tool; OR
- when working with really big JSON files (gigabytes of size), where other tools take too long or run out of memory; OR
- when the input is a stream with possibly long delays between chunks, for example a network socket.
When does rq
fall short?
- when Unicode escape sequences are used (issue #117);
- when advanced JSONPath selectors are required (area: selector issues);
- when targetting a
no-std
environment 3.
rsonpath-lib
The rsonpath-lib
crate is a JSONPath library serving as the backend of rq
.
It is a separate product, providing a wider API surface and extensibility
options beyond those of the CLI.
When to choose rsonpath-lib
?
- when an application spends a lot of time querying JSON documents (or parsing those for querying); OR
- the application only needs to parse and create a DOM for parts of the JSON that are extracted by a query;
- when a minimal memory footprint of JSON processing is desired; OR
- when JSON data comes in a stream (a
Read
impl) and can be queried in-flight; - when a tested JSONPath parser is needed for custom JSONPath processing.
When does rsonpath-lib
fall short?
- when the entire JSON document needs to be parsed into an in-memory model anyway for further processing 4;
- when Unicode escape sequences are used (issue #117);
- when advanced JSONPath selectors are required (area: selector issues);
- when targetting a
no-std
environment 3.
Even on queries adversarial to rq
it can be up to
faster than jq
, which takes over a second to process
a MB file.
jq
can consume upwards of the size of the
JSON document in heap memory. On a MB file it reaches a peak of MB.
On the same file and query rq
uses KB – a miniscule fraction of the
file size.
As far as we are aware there are no Rust JSON query
engines that would target no-std
. It would be possible for
rsonpath
to require only alloc
and no std
– if this is a feature
you would like to see, please let us know.
Performance gains of rsonpath
are nullified then, since
there is no benefit of a rapid, low memory query processor if the full document
gets parsed later anyway. In such a case,
serde_json_path
or a different
crate could suit one better. Note that restricting parsing
to fragments of a document returned by a filtering query can still yield
important gains.
Who is this book for?
The book is divided into three parts, each targeted at a different audience.
-
Part I – CLI User Guide is aimed at users of
rq
. It covers installation and basic usage, JSONPath query language reference, and advanced tips and tricks on juicing every last bit of performance out ofrq
. -
Part II – Library User Guide is aimed at developers looking to utilize
rsonpath-lib
in their projects. It contains an overview of the API surface and a breakdown of all configuration knobs that can be tuned for performance. -
Part III – Developer Guide is aimed at developers looking to contribute to
rsonpath
. It goes into the nitty-gritty details of the codebase, but should still be relatively broad and approachable.
Regardless of which category of users you fall into, you need to understand JSONPath queries. We describe the semantics in JSONPath Reference.
Authors
See Acknowledgements for references, citations, and special thanks.
This book is maintained as part of the rsonpath
project, and thus
is a collective work of the contributors. The reader should note, however,
that I, Mateusz Gienieczko, am the primary editor, and I take responsibility
for the contents within.
Unless otherwise specifically stated, all the contents of the book are licensed under the MIT license, excluding any external content to which hyperlinks appear in the text.
Introduction
This part of the book will describe everything you need to know to utilize
rq
in your workload, starting from installation and basic usage, all the
way to fine-tuning its performance using advanced configuration.
Installation
Currently, the easiest way to get it is from latest GitHub release. We have a binary for each Tier 1 Rust target.
Verifying provenance
All of our binary distributions implement SLSA level 3.
What that means is that any official rq
binary can be verified to have been
built from a specific version of rsonpath
source with our official GitHub Release CI.
This is called provenance.
To verify provenance you should investigate the multiple.intoto.jsonl
file available
in the GitHub release (in the standard
in-toto format), using the slsa-verifier
tool.
For example, to verify the rq-x86_64-unknown-linux-gnu
binary for version v0.8.0, run:
$ slsa-verifier verify-artifact \
$ --provenance-path ./multiple.intoto.jsonl \ # Path to the released provenance file.
$ --source-uri github.com/V0ldek/rsonpath \ # Our repository URL. This is case sensitive!
$ --source-versioned-tag v0.8.0 \ # Version tag of our release, in the format v#.#.#
$ ./rq-x86_64-unknown-linux-gnu # Path to the binary to verify.
Verified signature against tlog entry index 34193532 at URL: https://rekor.sigstore.dev/api/v1/log/entries/24296fb24b8ad77a576a14ffb58e0477203bcd311b396b9a4c8c3cc66484053a451b67faf87c1542
Verified build using builder "https://github.com/slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@refs/tags/v1.9.0" at commit 5e6d505182213df857c2b1cb026abf79cf3b54df
Verifying artifact ./rq-x86_64-unknown-linux-gnu: PASSED
PASSED: Verified SLSA provenance
PASSED guarantees that this is a properly signed, untampered-with binary generated
from our repository at a given version tag. It can be safely ran on your system.
To verify it works, check if rq
is available from your command line:
$ rq -V
rq 0.9.4
Package managers
When released, rq
will be available as a package in more distribution,
but currently you can install it via cargo
.
Install with cargo
The rq
binary is contained in the rsonpath
crate.
cargo install rsonpath
Manual build for maximum performance
The packaged installation methods are portable and the same executable can be safely shared between different machines with the same basic architecture (x86, ARM).
Building rq
for a specific CPU makes it not portable, but creates code
explicitly optimized for the machine its built on, enabling better
performance.
Building from source
Building from source requires your machine to have the rust tooling available.
We default to linking with lld
, so you need that as well.
First, clone the
rsonpath
repository:
git clone https://github.com/V0ldek/rsonpath.git
Building and installing is done most easily with just
:
just install-native
Without just
one can use:
RUSTFLAGS="-C target-cpu=native" cargo install --path ./crates/rsonpath
Building from crates.io
You can enable native CPU codegen when installing from crates.io
as well,
by overriding rustc
flags.
RUSTFLAGS="-C target-cpu=native" cargo install rsonpath
Verifying native optimizations are enabled
To verify that your rq
installation has native CPU support,
consult rq --version
and look for target-cpu=native
in the “Codegen flags”
field.
$ rq --version
rq 0.9.1
Commit SHA: 05ced6146b2dcc4e474f2dbc17c2e6d0986a7181
Features: default,simd
Opt level: 3
Target triple: x86_64-unknown-linux-gnu
Codegen flags: target-cpu=native,link-arg=-fuse-ld=lld
SIMD support: avx2;fast_quotes;fast_popcnt
Usage
Running rq
requires a JSONPath query, and a valid JSON input.
The query is always provided inline, while the input can come from a file,
standard input, or an argument.
Input mode
The rq
app supports three different input sources.
Input from file
The primary input mode is from a JSON file specified as the second positional
argument. For example, if there’s a file in the current directory called
ex.json
with the contents:
{
"values": [
{
"key": "key1",
"value": "value1"
},
{
"key": "key2",
"value": "value2"
}
]
}
then we can run the query by specifying ./ex.json
as the file path:
$ rq '$..[*].key' ./ex.json
"key1"
"key2"
Inline input
JSON can be passed directly with the --json
argument:
$ rq '$..*' --json '{ "a": 42, "b": "val" }'
42
"val"
This is sometimes more ergonomic when the document is very small.
Input from stdin
If an input is not provided with other means, rq
reads from standard input.
Note: if the input is a file, it is always more efficient
to provide it as a path than to pipe it to rq
’s standard input.
Doing cat $file | rq $query
is an antipattern.
Output mode
By default rq
outputs all matched values, in the order they occur in the
document. Of note is that the original formatting is preserved.
For example, if pretty.json
contains:
{
"key": {
"contents": 0
}
}
then extracting the nested object will result in:
$ rq '$.key' ./pretty.json
{
"contents": 0
}
You can see all the original whitespace preserved.1
Count result mode
Sometimes the concrete matches are not interesting, and we only want to count
how many matches there are. This can be done much more efficiently than full
matches, and can be enabled by passing count
to the --result
flag
(or its -r
shorthand).
$ rq '$[*]' --json '[0,1,2,3]' -r count
4
Indices result mode
There is also a result mode that outputs the byte offset in the input document. This is sometimes useful when you have access to the file and want to perform post-query custom parsing on the values by correlating the indices with the original file.
$ rq '$[*]' --json '[0,1,2,3]' -r indices
1
3
5
7
Advanced input options
There are many different ways in which rq
could read the provided input.
By default it tries its best to decide on the best method.
For example, in file mode it uses memory maps when the files are large.
This might be problematic if memory maps are not available on your machine,
or are very slow for some reason. In that case you can manually override
the input mode with the --force-input
argument.
The three modes available are:
mmap
– always use memory maps;eager
– read entire contents of the file or stdin to memory, run the query after; this makes sense for input documents that are not excessively large;buffered
– read the contents in a buffered manner; this is good for inputs that are very large or have low write throughput.
Reformatting the output would decrease performance,
and doing it quickly (for rsonpath
standards) would take a lot of effort.
It is not impossible, however; if this is a serious issue for your use case, please,
raise an issue.
JSONPath reference
Regardless of whether you want to use rq
, the rsonpath-lib
library,
or contribute to the project, you should be familiar with JSONPath, the core
query language we use to process JSONs.
The JSONPath language is defined by
an IETF specification,
currently in draft. The rsonpath
project implements a subset of the language
according to the spec with two major differences outlined in
rsonpath
-specific behavior.
The below reference uses terminology from the spec, but tries to use less dry language. If you already know the spec, you can probably skip this chapter.
JSONs as trees
A JSON document is a tree structure, defined in the intuitive way.
A node is either an atomic value, i.e. a number, string,
true
, false
, or null
, or a complex value, i.e. an object
or a list.
An object is a collection of members identified by member names or keys. Each member name has a single child node associated. A list is an ordered collection of child nodes identified by a zero-based index.
Anatomy of a query
A JSONPath query, in essence, defines a pattern that a path in a JSON must match for the node at that path to be selected. The simplest query is a sequence of keys.
$.a.b.c.d
It will access the value of the "a"
key in the root, then the value
under the "b"
key in that object, then the value under "c"
,
and finally the value under "d"
. For example, in the JSON:
{
"a": { "b": { "c": { "d": 42 } } }
}
it will access the value 42
by digging into the structure key by key.
$ rq '$.a.b.c.d' --json '{ "a": { "b": { "c": { "d": 42 } } } }'
42
In general, a JSONPath query is a sequence of segments. Each segment contains one or more selectors. Canonically, selectors are delimited within square brackets, but some selectors have a shorthand dot-notation. For example, the query above is equivalent to:
$['a']['b']['c']['d']
$ rq "$['a']['b']['c']['d']" --json '{ "a": { "b": { "c": { "d": 42 } } } }'
42
A valid query starts with the $
character, which represents the root
of the JSON. In particular, the query $
simply selects the entire document.
Segments
There are two types of segments:
-
child segment selects immediate children, or, in other words, digs into the structure of the document one level deeper. A child segment is either a bracketed sequence of selectors
[<sel1>, ..., <selN>]
, or a shorthand dot notation like.a
or.*
. -
descendant segment selects any subdocument, or, in other words, digs into the structure of the document at any level deeper. A descendant segment is either a bracketed sequence of selectors preceded by two dots
..[<sel1>, ..., <selN>]
, or a shorthand double-dot notation like..a
or..*
.
Selectors
Note that we only cover selectors that are currently supported by rsonpath
.
Issues to support more selectors can be found under the
area: selector label.
Name selector
The name selector selects the child node under a given member name.
It’s most commonly found under its shorthand form, .key
or ..key
,
which works with simple alphanumeric member names.
In the canonical form, the name has to be enclosed between single or double quotes, and enables escape sequences. For example:
.a
,['a']
,["a"]
all select a child under the keya
.['"']
selects a child under the key"
.["'"]
selects a child under the key'
.['complex name']
selects a child under the key containing a space:
$ rq "$['complex name']" --json '{ "complex name": 42 }'
42
Wildcard selector
The wildcard selector selects any child node, be it under a member name
in an object, or a value in a list. It also has a common shorthand form,
.*
or ..*
, whereas the canonical form is [*]
. For example, running
on:
{
"a": 42,
"b": [ 1, 2 ]
}
the query $[*]
selects 42
, and [ 1, 2 ]
.
$ rq '$[*]' --json '{ "a": 42, "b": [ 1, 2 ] }'
42
[ 1, 2 ]
Using the descendant selector we can recursively extract elements from the list:
$ rq '$..[*]' --json '{ "a": 42, "b": [ 1, 2 ] }'
42
[ 1, 2 ]
1
2
In general, the query ..*
selects all subdocuments of the JSON.
It’s not a smart query, as it can create outputs much longer than the source
document itself, consuming a lot of resources.
Index selector
The index selector selects a value from a list at a given zero-based index.
It only has a bracketed form, [index]
. For example, running on:
[ 1, 2, 3 ]
- the query
$[0]
selects1
; - the query
$[1]
selects2
; - the query
$[2]
selects3
; and - the query
$[3]
selects nothing, since the list has only 3 elements.
$ rq '$[0]' --json "[ 1, 2, 3 ]"
1
$ rq '$[1]' --json "[ 1, 2, 3 ]"
2
$ rq '$[2]' --json "[ 1, 2, 3 ]"
3
$ rq '$[3]' --json "[ 1, 2, 3 ]"
Combining segments
Segments can be chained arbitrarily to create complex queries.
For example, if we have a file ex.json
{
"firstName": "John",
"lastName": "Doe",
"number": "078-05-1120",
"phoneNumbers": [
{
"type": "work",
"number": "0123-4567-8888"
},
{
"type": "home",
"number": "0123-4567-8910"
}
],
"spouse": {
"firstName": "Jane",
"lastName": "Doe",
"number": "078-05-1121",
"phoneNumbers": [
{
"type": "work",
"number": "0123-4567-9999"
},
{
"type": "home",
"number": "0123-4567-8910"
}
]
}
}
we can extract all phone numbers with:
$ rq '$..phoneNumbers[*].number' ./ex.json
"0123-4567-8888"
"0123-4567-8910"
"0123-4567-9999"
"0123-4567-8910"
Note that each part of the query is needed here:
- the first segment is descendant, so that we pick up both the root’s number array and the one under “spouse”;
- without specifying the “phoneNumbers” key (for example running
$..number
) we wouldn’t be able to filter out the two irrelevant “number” keys; - the wildcard selector
[*]
makes sure we select all the numbers, regardless of how long the list may be.
Selector availability
Not all of JSONPath’s functionality is supported by rsonpath
as of right now.
Supported segments
Segment | Syntax | Supported | Since | Tracking Issue |
---|---|---|---|---|
Child segment (single) | [<selector>] | ✔️ | v0.1.0 | |
Child segment (multiple) | [<selector1>,...,<selectorN>] | ❌ | ||
Descendant segment (single) | ..[<selector>] | ✔️ | v0.1.0 | |
Descendant segment (multiple) | ..[<selector1>,...,<selectorN>] | ❌ |
Supported selectors
Selector | Syntax | Supported | Since | Tracking Issue |
---|---|---|---|---|
Root | $ | ✔️ | v0.1.0 | |
Name | .<member> , [<member>] | ✔️ | v0.1.0 | |
Wildcard | .* , ..* , [*] | ✔️ | v0.4.0 | |
Index (array index) | [<index>] | ✔️ | v0.5.0 | |
Index (array index from end) | [-<index>] | ❌ | ||
Array slice (forward, positive bounds) | [<start>:<end>:<step>] | ❌ | #152 | |
Array slice (forward, arbitrary bounds) | [<start>:<end>:<step>] | ❌ | ||
Array slice (backward, arbitrary bounds) | [<start>:<end>:-<step>] | ❌ | ||
Filters – existential tests | [?<path>] | ❌ | #154 | |
Filters – const atom comparisons | [?<path> <binop> <atom>] | ❌ | #156 | |
Filters – logical expressions | && , || , ! | ❌ | ||
Filters – nesting | [?<expr>[?<expr>]...] | ❌ | ||
Filters – arbitrary comparisons | [?<path> <binop> <path>] | ❌ | ||
Filters – function extensions | [?func(<path>)] | ❌ |
rsonpath
-specific behavior
We try to implement the JSONPath spec
as closely as possible. There are currently two major differences between
rsonpath
’s JSONPath and the standard.
Nested descendant segments
The standard semantics of the descendant segment lead to duplicated results,
and a potentially exponential blowup in execution time and output size.
In rsonpath
we diverge from the spec to guarantee unduplicated results:
$ rq '$..a..a' --json '{ "a": { "a": { "a": 42 } } }'
{ "a": 42 }
42
In standard semantics the value 42
would be matched twice1.
Unicode
Currently rsonpath
compares JSON keys bytewise, meaning that labels using
Unicode escape sequences will be handled incorrectly.
For example, the key "a"
can be equivalently represented by
"\u0041"
.
$ rq '$["a"]' --json '{ "a": 42 }'
42
The above results should be the same if either or both of the "a"
characters
were replaced with the "\u0041"
unicode escape sequence, but rsonpath
does
not support it at this time. This limitation is known and tracked at
#117.
The reason behind this is a bit subtle. The standard
defines the result as a concatenation of lists of results of executing
the rest of the query after a descendant segment, and a recursive execution
of the entire query. So the inner "a"
key in the example is matched first
when evaluating the outermost one, and then again when evaluating the middle "a"
.
We consider this to be counter-intuitive and undesirable.
Query optimization
Our tool’s ambition is to be the fastest JSONPath engine of all. This is done in a similar vain to regex engines, where we try to find the computationally simplest way of performing a given query. This is highly dependent on the query itself, and thus it’s possible to compromise performance by making the query less friendly to the engine.
It’s not always obvious if a query is going to be “nice” to rsonpath
.
In this chapter we try to outline some common ways of making a query
faster by rewriting it to a different, yet equivalent, form.
Starting with a descendant search
The operation that rsonpath
perform the fastest is looking for the first key
when the query starts with a descendant name selector.
There is no way to make use of this automatically, but as a user you might
have insight into the schema of the documents that are being queried.
Imagine you have a query $.products[*].videoChapters
that selects
all video chapters from a list of products. It just so happens that
in the input document the only occurrences of “videoChapters”
are within the “products” list. Therefore, a query ..videoChapters
would be equivalent and select exactly the same nodes.
The above example is an actual real-life case. The rewritten query is over ten times faster than the original, so an order of magnitude.
Note that this specifically relates only to the first selector being a descendant selector.
Omitting wildcards
The wildcard selector is relatively expensive, as it forces the engine to closely look at every value it encounters. Using reasoning similar to the one in the previous section it’s sometimes possible to eliminate a wildcard selector by either using a specific name to match, or replacing it with a descendant selector.
Take an extended query from the above example that digs into structure
to select the “chapter” key of a video chapter: ..videoChapters[*].chapter
.
Again, it just so happens that the query ..videoChapters..chapter
is equivalent,
as all “chapter” keys always occur only ones underneath a “videoChapters” entry.
The rewritten query will be faster
Just as well, it is always better to make the query more specific, if possible.
The query $['key']<rest>
will always be faster than $[*]<rest>
.
Avoiding descendant wildcards
The absolute worst query to run is $..*
. It requires the engine to look
at every value in the document, nullifying most optimizations.
When facing performance problems, try to express your query without a descendant
wildcard, if possible, or at least to restrict it to a smaller portion of the
document. For example, $.key..*
will be faster than $..*
by itself.
Reporting an issue
We consider performance a paramount feature of rsonpath
.
If you’re facing queries that are excessively slow for your taste,
complain to us by reporting Issues so that we can benchmark
against your use case.
Reporting issues
The rsonpath
project is under active development.
However, we have very limited resources that can be spent on it,
so user feedback is crucial to make us aware of issues and decide
how to prioritize them.
Bugs
If you find a bug, report it with our bug report form on GitHub.
The most important part of a bug report is an MRE – Minimal Reproducible Example.
For rsonpath
this is usually the query, and the JSON document that causes the bug.
You might not be possible to publish the actual document due to licensing
or non-disclosure agreements, but it is often possible to narrow down the issue
and publish a small, anonymized example, where keys in the query and the document
are replaced with different, meaningless values.
New features/enhancements
If there is a particular feature you really need, or maybe a query that takes too long and we could optimize against, you should create an issue and let us know.
If there is an issue already open related to your problem, comment or upvote that issue to let us know to prioritize that! It’s a small thing, but really helpful.
Security vulnerabilities
Security vulnerabilities follow a separate flow, using GitHub’s private reporting system. Consult Privately reporting a security vulnerability to learn how to open a report.
Introduction
This part of the book is a work in progress.
extern crate rsonpath; extern crate rsonpath_syntax; use rsonpath::engine::{Compiler, Engine, RsonpathEngine}; use rsonpath::input::BorrowedBytes; use rsonpath::result::count::CountRecorder; use std::error::Error; fn main() -> Result<(), Box<dyn Error>> { // Parse a JSONPath query from string. let query = rsonpath_syntax::parse("$..phoneNumbers[*].number")?; // Convert the contents to the Input type required by the Engines. let mut contents = r#" { "person": { "name": "John", "surname": "Doe", "phoneNumbers": [ { "type": "Home", "number": "111-222-333" }, { "type": "Work", "number": "123-456-789" } ] } } "#; let input = BorrowedBytes::new(contents.as_bytes()); // Compile the query. The engine can be reused to run the same query on different contents. let engine = RsonpathEngine::compile_query(&query)?; // Count the number of occurrences of elements satisfying the query. let count = engine.count(&input)?; assert_eq!(2, count); Ok(()) }
Introduction
This part of the book is a work in progress.
Acknowledgements
The rsonpath
project was inspired by theoretical work by
Corentin Barloy, Filip Murlak, and Charles Paperman in
Stackless Processing of Streamed Trees.
It would not be possible to create this without prior research into
SIMD-accelerated JSON processing, first by Geoff Langdale and Daniel Lemire
in Parsing gigabytes of JSON per second
and the simdjson
project, then by
Lin Jiang and Zhijia Zhao in
JSONSki: streaming semi-structured data with bit-parallel fast-forwarding
and the JSONSki
project.
All references and citations can be found in my master’s thesis, Fast execution of JSONPath queries, and in the subsequent paper, Supporting Descendants in SIMD-Accelerated JSONPath. Both are also hosted in this repository in /pdf.
Special thanks to Filip Murlak and Charles Paperman for advising me during my thesis, when most of the fundamentals of the project were born.