Making Parsers Go Brr

Techniques used to optimise oxc, the JavaScript compiler written in Rust

Making Parsers Go Brr

I’ve picked up a few performance engineering tricks after watching this talk on oxc. This blog consists of the techniques used by the oxc team to make their parsers reallyy fast. But first, what is oxc?

Oxc is set of JavaScript tools written in Rust, including a linter, minifier, formatter, and many more. But, in order to do any of those transformations, we first have to transform the program into an Abstract Syntax Tree (AST), a data structure that represents programs.

Consider the line let x = 1 + 2, it could be first lexed into these tokens:

[
  { type: LET }
  { type: IDENTIFIER, value: "x" }
  { type: ASSIGN }
  { type: INT, value: 1 }
  { type: PLUS }
  { type: INT, value: 2 }
]

And then, parsed into the following AST:

              LET
             /   \
            /     \
           /       \
IDENTIFIER: "x"     \
                   PLUS
                   /   \
                  /     \
                 /       \
              INT: 1    INT: 2

These topics are out of scope for this post so, if you’re interested, I’d recommend reading Writing an Interpreter in Go.


Slimmer Token Types

When parsing, tokens are the most allocated objects. Tokens have types to identify what they represent, it could be represented as such:

pub enum TokenType {
  Number(f64),
  String(String)
}

TokenType is a sum type, meaning that it can represent one out of a set of values. In this case, a TokenType enum could be either a number or a string. Let’s take a look at the number of bytes needed to represent this data structure.

The f64 type represents 64bit floating point numbers, requiring 8 bytes.

The String type is an abstraction over a vector of bytes, consisting of three fields:

  • a pointer to the data: 8 bytes
  • the current length of the data: 8 bytes
  • the reserved capacity for the data: 8 bytes
  • total: 24 bytes

Since an enum can have multiple possible values, they need some form of an identifier for the program to know which variant they are dealing with. They’re also sized to be big enough to store all possible variants.

In this case, the TokenType enum would have the following fields:

  • tag: 4 bytes
  • data: 24 bytes (since String is larger than f64)
  • total: 28 bytes

If we were to store the token data elsewhre, we can shrink the layout of TokenType to:

pub enum TokenType {
  Number,
  String
}

Then, the size of the enum would be reduced to 4 bytes for the tag, simply by modifying the way we represent our data stucture.


Smaller Integer Representations

As for the representation of a Token, it could be something like:

pub struct Token {
  start: u64,
  end: u64
}

Tokens are parsed byte by byte, so Token { start: 0, end: 2 } would represent a token starting from byte 0 and ending at byte 2.

Since start and end are both unsigned 64bit numbers, they require 8 bytes each, which is 16 bytes per Token. But, do we actually need all 64bits?

The maximum value of u64 is about 1.84e19, meaning it can represent 1.84e19 bytes, which is about 18.4 billion GB. That’s wayyy too much memory for JavaScript files (or is it).

If we were to use u32 instead, with a max value of 4.29 billion, we can still represent tokens in a 4.29GB file, which is plenty.

The change of integer types allowed us to shrink Token from 16 bytes to 8 bytes. This may seem small, but in large programs, the bytes add up.


Small String Optimisation

Storing string data is expensive due to heap allocations due to switching between user and kernel mode for syscalls, and having to find a continuous block of memory big enough for the allocation. Pointer chasing also hurts locality as we cannot utilise the cache line, more on that later.

Fortunately, this crate allows strings less than 24 bytes to be allocated on the stack, reducing heap allocations and improving cache locality.


SIMD (Single Instruction Multiple Data)

Comments don’t affect code execution, since they are only meant for humans. Therefore lexers must recognise and skip them to prevent wasting memory on redundant tokens. Comments and JSDocs can get very long, and parsing them one byte at a time isn’t very efficient.

Modern CPUs offer SIMD instructions that can operate on groups of values in parallel. For example, instead of looping over two arrays to add the both of them together, we could use SIMD to multiple elements in one instruction.

For oxc, the same can be applied when lexing tokens. SIMD can be used when the next byte is the start of a comment, so that we can skip to the end a lot more quickly.


Cache Lines and Arena Allocators

Modern CPUs don’t read data byte by byte, they read them in groups of 64 bytes. This forms the cache line which means that whatever’s next to our current read would be loaded into the cache as well.

Imagine the following array let arr = [1, 2, 3, 4, 5]. If we were to read arr[0], the computer would load all the other elements into the cache as well. So, subsequent calls to read arr[1/2/3/4] would be much faster.

If we were to allocate memory for tokens one by one, this would cause random, sparse allocations on the heap. It prevents us from utilising the cache lines since whatever data prefetched would not contain what we need next.

An arena allocator can be used to allocate a large section of heap memory at a time. This allows token data to be allocated together, enabling us to better utilise our cache lines and even reducing the syscalls needed for allocation and deallocation.


Modern CPUs are really really fast, and if done right, they can be squeezed to perform tasks in the blink of an eye. For example, oxlint only took 3.4 seconds to lint 122,630 files with 24 threads! source

I hope this gave you a glimpse into performance engineering. Even if you don’t write Rust, the techniques mentioned here can be applied to other languages as well. Let me know if you spot any mistakes! (|) ._. (|/)