ponyfoo.com

JavaScript Performance Pitfalls in V8

Fix
A relevant ad will be displayed here soon. These ads help pay for my hosting.
Please consider disabling your ad blocker on Pony Foo. These ads help pay for my hosting.
You can support Pony Foo directly through Patreon or via PayPal.

In recent years, JavaScript engines have improved on all fronts. The performance of JavaScript has reached a level where it can easily compete with programming languages that have traditionally been considered more appropriate for high-performance computing. This is not only true for V8, the JavaScript engine inside of Chrome and Node.js, but for all major JavaScript engines, including ChakraCore, the engine inside of Edge, JavaScriptCore, the engine inside of Safari, and SpiderMonkey, the engine inside of Firefox.

Not only did the peak performance improve, but engines also managed to deliver more consistent and predictable levels of performance. Given that JavaScript is a highly dynamic language, all of this performance is based on choosing the right heuristics in the engine. JavaScript engines use techniques like speculative optimization and inline caching to speed up execution of the likely path.

But heuristics can also easily work against you, and it helps to be aware of them. So today I’m gonna share some background on two subtle performance pitfalls in the V8 JavaScript engine. Being aware of these potential pitfalls might help you to resolve issues if you get bitten by these more easily.

Optimization limit

The compilers built into V8 - both the TurboFan optimizing compiler and the Ignition bytecode generator - are so-called method JITs, meaning the unit of compilation is always a method, aka a function in JavaScript speak. The optimizing compiler is able to include the bodies of other methods when it finds hot call sites and sees potential for further optimizations via doing this, which is commonly referred to as inlining. Contrast this with other runtimes that use so-called tracing JITs - LuaJIT and PyPy are popular examples here - where the unit of optimization is an arbitrary sequence of consecutive bytecodes that have been executed repeatedly previously.

In both method JITs as well as tracing JITs, there’s always an upper limit on the size of the input that the optimizer can deal with. For method JITs this limit is naturally defined by the size of the function itself. TurboFan also has such a limit, which is currently 60 KiB of bytecodes (the kMaxBytecodeSizeForOpt constant in runtime-profiler.cc). That means if you have functions whose generated bytecode is bigger than 60 KiB, they will not ever be optimized by TurboFan, even if they are considered hot (i.e. invoked very often).

Let’s consider an example, using eval to dynamically generate functions. (We intentionally don’t use the Function constructor here since that doesn’t allow us to put a name onto the function that is syntactically recognizable by the JavaScript engine, and that would make the investigations really challenging).

function generate(n) {
  let s = "(function add" + n + "(x) { return 0";
  for (let i = 0; i < n; ++i) {
    s += "+x";
  }
  s += "; })";
  return eval(s);
}

This generate function constructs a new JavaScript function object which adds up the parameter passed to it n times. So when you call generate(10) you’ll get a function object that looks like this:

function add10(x) {
  return 0+x+x+x+x+x+x+x+x+x+x;
}

Now let’s take a concrete example of the above and run it in Node.js:

// add10.js

function generate(n) {
  let s = "(function add" + n + "(x) { return 0";
  for (let i = 0; i < n; ++i) {
    s += "+x";
  }
  s += "; })";
  return eval(s);
}

const add10 = generate(10);
add10();

Looking at the generated bytecode for the add10 function, using the --print-bytecode flag with the node shell, we see something like this (output is from Node v10.15):

$ node --print-bytecode add10.js
…
[generated bytecode for function: add10]
Parameter count 2
Frame size 8
   19 E> 0x279c523204b2 @    0 : a5                StackCheck
   26 S> 0x279c523204b3 @    1 : 0b                LdaZero
         0x279c523204b4 @    2 : 26 fb             Star r0
         0x279c523204b6 @    4 : 25 02             Ldar a0
   34 E> 0x279c523204b8 @    6 : 34 fb 00          Add r0, [0]
         0x279c523204bb @    9 : 26 fb             Star r0
         0x279c523204bd @   11 : 25 02             Ldar a0
   36 E> 0x279c523204bf @   13 : 34 fb 01          Add r0, [1]
         0x279c523204c2 @   16 : 26 fb             Star r0
         0x279c523204c4 @   18 : 25 02             Ldar a0
   38 E> 0x279c523204c6 @   20 : 34 fb 02          Add r0, [2]
         0x279c523204c9 @   23 : 26 fb             Star r0
         0x279c523204cb @   25 : 25 02             Ldar a0
   40 E> 0x279c523204cd @   27 : 34 fb 03          Add r0, [3]
         0x279c523204d0 @   30 : 26 fb             Star r0
         0x279c523204d2 @   32 : 25 02             Ldar a0
   42 E> 0x279c523204d4 @   34 : 34 fb 04          Add r0, [4]
         0x279c523204d7 @   37 : 26 fb             Star r0
         0x279c523204d9 @   39 : 25 02             Ldar a0
   44 E> 0x279c523204db @   41 : 34 fb 05          Add r0, [5]
         0x279c523204de @   44 : 26 fb             Star r0
         0x279c523204e0 @   46 : 25 02             Ldar a0
   46 E> 0x279c523204e2 @   48 : 34 fb 06          Add r0, [6]
         0x279c523204e5 @   51 : 26 fb             Star r0
         0x279c523204e7 @   53 : 25 02             Ldar a0
   48 E> 0x279c523204e9 @   55 : 34 fb 07          Add r0, [7]
         0x279c523204ec @   58 : 26 fb             Star r0
         0x279c523204ee @   60 : 25 02             Ldar a0
   50 E> 0x279c523204f0 @   62 : 34 fb 08          Add r0, [8]
         0x279c523204f3 @   65 : 26 fb             Star r0
         0x279c523204f5 @   67 : 25 02             Ldar a0
   52 E> 0x279c523204f7 @   69 : 34 fb 09          Add r0, [9]
   55 S> 0x279c523204fa @   72 : a9                Return
Constant pool (size = 0)
Handler Table (size = 0)
…

Since Node.js runs a lot of its own JavaScript, you might get pages of output here. Search for the phrase [generated bytecode for function: add10] where the following bytecode dump contains a sequence of 10 Adds with Star and Ldar bytecodes in between. There’s a blog post on Understanding V8’s Bytecode from my former colleague Franziska Hinkelmann if you’re curious to learn more about this topic.

Removing the irrelevant parts we see this concrete bytecode output:

  0 : a5                StackCheck
  1 : 0b                LdaZero
  2 : 26 fb             Star r0
  4 : 25 02             Ldar a0
  6 : 34 fb 00          Add r0, [0]
  9 : 26 fb             Star r0
 11 : 25 02             Ldar a0
 13 : 34 fb 01          Add r0, [1]
 16 : 26 fb             Star r0
 18 : 25 02             Ldar a0
 20 : 34 fb 02          Add r0, [2]
 23 : 26 fb             Star r0
 25 : 25 02             Ldar a0
 27 : 34 fb 03          Add r0, [3]
 30 : 26 fb             Star r0
 32 : 25 02             Ldar a0
 34 : 34 fb 04          Add r0, [4]
 37 : 26 fb             Star r0
 39 : 25 02             Ldar a0
 41 : 34 fb 05          Add r0, [5]
 44 : 26 fb             Star r0
 46 : 25 02             Ldar a0
 48 : 34 fb 06          Add r0, [6]
 51 : 26 fb             Star r0
 53 : 25 02             Ldar a0
 55 : 34 fb 07          Add r0, [7]
 58 : 26 fb             Star r0
 60 : 25 02             Ldar a0
 62 : 34 fb 08          Add r0, [8]
 65 : 26 fb             Star r0
 67 : 25 02             Ldar a0
 69 : 34 fb 09          Add r0, [9]
 72 : a9                Return

Looking at the output above, we see that the generated bytecode for add10 is 73 bytes in size (72 is the offset of the Return bytecode and that instruction is 1 byte in size). Given what we learned before about the 60 KiB limit in TurboFan, this function should be easily optimizable by V8. Let’s try and see what happens if we put that function in a hot loop.

// add10-optimized.js

function generate(n) {
  let s = "(function add" + n + "(x) { return 0";
  for (let i = 0; i < n; ++i) {
    s += "+x";
  }
  s += "; })";
  return eval(s);
}

const add10 = generate(10);

let result = 0;
for (let i = 0; i < 5 * 1000; ++i) {
  result += add10(i);
}

Running this snippet inside of Node, passing the --trace-opt command line flag, we see something similar to the following output:

$ node --trace-opt add10-optimized.js
[marking 0x32fc83347751 <JSFunction add10 (sfi = 0x32fcde157b91)> for optimized recompilation, reason: small function, ICs with typeinfo: 10/10 (100%), generic ICs: 0/10 (0%)]
[compiling method 0x32fc83347751 <JSFunction add10 (sfi = 0x32fcde157b91)> using TurboFan]
[optimizing 0x32fc83347751 <JSFunction add10 (sfi = 0x32fcde157b91)> - took 0.823, 0.514, 0.016 ms]
[completed optimizing 0x32fc83347751 <JSFunction add10 (sfi = 0x32fcde157b91)>]

You may see messages about other functions being optimized or marked for optimized recompilation as well, just ignore those. The interesting line is the last line, which says that add10 was successfully optimized by TurboFan. This matches our expectations. Note that every modern JavaScript engine gives you this behavior.

// add10000.js

function generate(n) {
  let s = "(function add" + n + "(x) { return 0";
  for (let i = 0; i < n; ++i) {
    s += "+x";
  }
  s += "; })";
  return eval(s);
}

const add10000 = generate(10 * 1000);

let result = 0;
for (let i = 0; i < 5 * 1000; ++i) {
  result += add10000(i);
}

Let’s see what happens if we increase the size of the generated function such that the bytecode for the function exceeds the limit in TurboFan. We choose a rather arbitrary number of 10,000 additions for this example. Running…

node --trace-opt add10000.js

Doesn’t print anything to the console, meaning that add10000 is not even considered for optimization by TurboFan (it’s a bit unfortunate that V8 doesn’t say anything in this case, but just silently continues). In fact TurboFan is not even involved here! Instead, the so-called RuntimeProfiler that profiles bytecode execution immediately decides that the function is too big to be considered for optimization and disables optimization of the function (in the RuntimeProfiler::ShouldOptimize method).

Running node again with --print-bytecode we see the culprit:

$ node --print-bytecode add10000.js
…
[generated bytecode for function: add10000]
Parameter count 2
Frame size 8
   18 E> 0x2e785b7dcf22 @    0 : a0                StackCheck
   24 S> 0x2e785b7dcf23 @    1 : 0b                LdaZero
         0x2e785b7dcf24 @    2 : 26 fb             Star r0
         0x2e785b7dcf26 @    4 : 25 02             Ldar a0
   32 E> 0x2e785b7dcf28 @    6 : 32 fb 00          Add r0, [0]
         0x2e785b7dcf2b @    9 : 26 fb             Star r0
         0x2e785b7dcf2d @   11 : 25 02             Ldar a0
   34 E> 0x2e785b7dcf2f @   13 : 32 fb 01          Add r0, [1]
…
20026 E> 0x2e785b7f52aa @ 99208 : 00 32 fb ff 0d 27 Add.Wide r0, [9997]
         0x2e785b7f52b0 @ 99214 : 26 fb             Star r0
         0x2e785b7f52b2 @ 99216 : 25 02             Ldar a0
20028 E> 0x2e785b7f52b4 @ 99218 : 00 32 fb ff 0e 27 Add.Wide r0, [9998]
         0x2e785b7f52ba @ 99224 : 26 fb             Star r0
         0x2e785b7f52bc @ 99226 : 25 02             Ldar a0
20030 E> 0x2e785b7f52be @ 99228 : 00 32 fb ff 0f 27 Add.Wide r0, [9999]
20033 S> 0x2e785b7f52c4 @ 99234 : a4                Return
Constant pool (size = 0)
Handler Table (size = 0)

The function add10000 generated 99,235 bytes of bytecode instructions. That certainly exceeds the 60 KiB limit imposed by the RuntimeProfiler.

Background

There’s an ongoing discussion in v8:8598 regarding the issue of this optimization limit. In general, there’ll always be a limit of some form, since it’s all about trade-offs in the engine. In V8’s case, the limit allows TurboFan to store various counts (i.e. the number of inputs and outputs of entities in the intermediate representation) as 16-bit integers vs. 32-bit integers. Other engines and languages have similar limits. For example, in Java the 64 KiB method limit is even part of the JVM bytecode specification.

Take-away from this section

Make sure to split large functions into smaller building blocks. This is generally good advice for maintainability, but also helps the JIT to properly optimize everything that’s relevant to your application. Smaller functions also generally play well together with the inlining machinery inside the JIT, and often reduce the cost of compilation and optimization. Imagine you have a bigger function that does 10 different tasks;ven if you only need 1 or 2 of those tasks, you still pay the cost of compiling everything that’s in the function (at least compiling to bytecode), plus since inlining heuristics also take into account the size of the inlinee, it’s a lot less likely that such a function is inlined at a hot call site.

It’s worth pointing out that normally you don’t run into the 60KiB limit when writing JavaScript, except in some really extreme cases. But tools generating JavaScript programmatically (i.e. parser generators) can hit this limit easily.

Double fields

The V8 engine uses a technique called pointer tagging to encode arbitrary JavaScript values. The trick here to realize that while pointers can be used to address any single byte in memory, this is not needed for JavaScript objects in memory, which usually are aligned to word boundaries (i.e. 4-byte aligned on 32-bit architectures and 8-byte aligned on 64-bit architectures). So the least significant bits in any valid object pointer are zero. V8 uses these bits to encode additional information. In particular it uses the two least significant bits to distinguish between three different kinds of values:

A Smi is a small integer in the 31-bit range, i.e. a value between -1,073,741,824 and 1,073,741,8231, shifted up by one bit and padded with a 0 in the least significant bit. A HeapObject pointer is the address of an object in (managed) memory, with the two least significant bits set to 01. That means when V8 wants to get to the real address of the object it has to subtract one from the value. There’s also the WeakHeapObject, which has the least significant bits set to 11, and is essentially like HeapObject, except that the reference is treated weakly by the garbage collector.

For the purpose of this article we only care about Smi and HeapObject. Given that small integers are the most common number values in JavaScript programs, it was considered important to have an efficient value encoding for those. That’s why there’s the special Smi encoding: so that small integers can be stored efficiently, for example inside of objects.

const a = {
  x: 42,
  y: "Hello"
};

Just consider the object a above. It has two properties x and y, and x has a small integer value, whereas y has a String value. V8 is going to represent that object in memory like this:

Here a is represented as a JSObject with a Shape that holds the information about the properties of a, and the actual values of the properties are in the instance a (you can read more about shapes in our previous articles JavaScript engine fundamentals: Shapes and Inline Caches and JavaScript engine fundamentals: optimizing prototypes). Due to the value encoding mentioned above, the small integer 42 can be stored efficiently inside of the JSObject whereas the String value has to be represented as a separate entity in memory and pointed to by a HeapObject pointer.

Now for Strings that seems kind of obvious and makes sense to have them allocated as separate entities2, but what about other number values for example? In JavaScript numbers are 64-bit double precision floating point values so they can encode a lot of values outside the 31-bit integer range. In the basic pointer tagging scheme used in V8, all numbers outside this range have to represented as separately allocated HeapObject entities.

const b = {
  x: 2 ** 32,
  y: 1.5
};

In the example of b above, we assign x an integer value 4,294,967,296, which is outside the 31-bit integer range, and y is assigned a number that is not even an integer at all. Neither of these values cannot use the efficient Smi encoding, and thus need separate heap-allocated entities, so-called HeapNumbers in V8 speak:

As you can imagine this would be quite costly when you think of applications that operate on data structures that mostly consist of numeric values outside the 31-bit signed integer range, i.e. points, vectors, etc, and often update the values of these properties. In such cases every update to these properties would have to allocate a new HeapNumber entity, since these entities are by their nature immutable (since they are passed around and thus can be referenced from many different places).

const c = {x: 1.1};
c.x = 1.2;
c.x = 1.3;

Running this simple snippet results in the creation of three HeapNumber objects, one for each of the different numbers 1.1, 1.2 and 1.3.

Always having to allocate new HeapNumbers can be quite costly specifically for numeric applications, as it causes a lot of traffic on the garbage collector and may lead to bad cache locality.

To mitigate this problem for applications that operate on lots of objects with numeric properties, V8 has a concept called field representation tracking, which tracks for each field in an object what the representation of the values were so far. Specifically V8 tracks whether a field (as described by a concrete shape) has always stored numbers so far, where we distinguish between states Smi and Double (the former meaning that only values in the small integer range were seen, the latter means that also other number values outside that range occurred).

Fields with Smi representation are already stored efficiently as described above, but the optimizing compiler TurboFan can use the field representation information to generate more efficient code (i.e. it doesn’t need to check whether the field contains an integer since that’s known from the representation on the shape). For Double fields a new MutableHeapNumber entity was introduced that looks like a HeapNumber, but is tied to the instance and can be updated in-place.

const c = {x: 1.1};
c.x = 1.2;
c.x = 1.3;

Looking at this example with the update again, we see that only a single storage for the double values need to be allocated and it’s just updated to the most recent value in-place:

Notice how the Property information now tags the property 'x' as a Double field.

This optimization is done by V8 under the hood, as a developer you don’t even notice that it’s going on. And even better, on 64-bit architectures V8 can store the double values inline into the instance itself, using a technique called double field unboxing. I’m not going to describe that here, since the details of that are quite complex. Take-away so far should be that this goes on automatically and usually does the right thing for you.

But as with any heuristic, there are trade-offs. You might already ask yourself how that work with the dynamic nature of JavaScript? Specifically since a field can hold any arbitrary value at any given point in time. Like what if you decide to store a string into c.x at some point?

c.x = "Hello";

This is perfectly valid in JavaScript, and V8 handles this by having a lattice of field representations like this:

For each field V8 will choose the best possible representation given the value that’s being stored, and it will progress through the lattice as necessary. Like in the case of storing a string into a field that currently has a Double representation tag, it’ll progress the field to Tagged representation, which essentially says anything. Ideally fields should have one of Smi, Double or HeapObject representation tags (highlighted in orange above), i.e. you should not mix numbers and other values, for best performance in most cases. You may already know about the concept of elements kinds in V8, which is very similar to the field representation mechanism. Elements kinds are concerned with array-indexed properties, whereas field representations are used for non-array indexed properties.

Mixed number and non-number fields

When you store both number and non-number values into a field, its field representation is going to be Tagged for sure, since that’s the only valid representation for fields that hold both number and non-number values. If this is really what you intended to do, good. But if you intended to operate on number values really, then mixing in non-number values, including undefined or null primitives is a bad idea. Performance-wise the JavaScript engine can skip a bunch of checks if it knows that a field has Smi or Double representation. Consider this simple example

function distance(p1, p2) {
  const dx = p1.x - p2.x;
  const dy = p1.y - p2.y;
  return Math.sqrt(dx * dx + dy * dy);
}

which computes the distance between two points p1 and p2. Without any additional information about the representation of the fields x and y, the engine would need to check that the results of the four property accesses p1.x, p2.x, p1.y, and p2.y all yields numbers (and handle the case where they are not).

Specifically don’t ever pre-initialize number fields with undefined or null, unless there’s a good reason why this would be necessary for your application. That being said, don’t do something like this

class Point {
  constructor() {
    this.x = null;
    this.y = null;
  }

  setX(x) { this.x = x; }
  setY(y) { this.y = y; }
}

i.e. use null as a sentinel value for number fields. Prefer actual number values, i.e. maybe just 0 (if you plan on using small integers) or 0.1 or even NaN (if you plan on using floats).

class Point {
  constructor() {
    this.x = 0.1;
    this.y = 0.1;
  }

  setX(x) { this.x = x; }
  setY(y) { this.y = y; }
}

Smi to Double migrations

Even if you pay attention to what was said above, there’s still another subtle problem that you might run into with number fields. For example the React team got bitten by this recently. The subtle issue here, without going into too much detail, is that Smi and Double field representation aren’t compatible: The former requires the number to be encoded as part of the value in the field, the latter requires the number to be stored in a dedicated MutableHeapNumber entity. So going from Smi to Double involves a process called instance migration, where instances that used to have the value stored as a Smi need to be migrated to have the field value stored in a MutableHeapNumber instead.

This process involves creating entirely new shapes for the instances, and lazily migrating instances from the old shape to the new shape as the engine hits the instances. (It would be too expensive to stop the world and find all the instances in managed memory at the time when the migration is kicked off). Since shapes are organized in trees (as explained in JavaScript engine fundamentals: Shapes and Inline Caches), this might involve the creation of new shape subtrees. Consider this example of two objects o1 and o2 with Smi fields x and y:

const o1 = {};
o1.x = 3;
o1.y = 4;

const o2 = {};
o2.x = 5;
o2.y = 6;

This creates appropriate shapes and transitions between these shapes as outlined below:

Now imagine we change the value of o2.x to value 5.5, which cannot be represented as a Smi like this:

o2.x = 5.5;

Now V8 has to create a whole new subtree starting from the shape that introduced the property x, and have o2 use the new shape. The instance o1 still points to the previous shape, which is now marked as deprecated.

As the diagram shows, the old shape subtree is essentially disconnected from the root shape, and instead a new shape subtree is put in place. All the shapes in the old subtree are marked as deprecated, and will eventually be garbage-collected once no more instances use them. In the meantime however there’s additional overhead essentially keeping two separate trees of metadata alive.

The next time o1 is seen by any IC (inline cache), it automatically migrates the instance away from the deprecated shape to its so-called migration target, which is the corresponding shape in the new subtree. For example, if you now perform a property access like o1.y it’ll automatically do this self-healing. You don’t even need to store to it; loading any property from it is sufficient. And this also forces o1.x into a Double field representation, so that afterwards o1 and o2 use the same shape again:

Now imagine what happens when this has to be done for thousands of objects with plenty of properties on them, which is kinda what happened in facebook/react#14365, except that in their case, it was some bad interaction with Object.preventExtensions() on top, where a bug in V8 causes it to miss the proper migration target somehow, and so V8 ended up allocating a new unique shape for each and every FiberNode instance in this application. The relevant issue v8:8538 is still under investigation.

The actionable advice for you here is to make sure that whenever you’re going to use a field to hold arbitrary number values (including values outside the small integer range), initialize this field with a non-small integer number. That way, you get the correct shapes from the beginning, and you avoid the migration cost (and troubles). One easy way to accomplish this is to store NaN initially3, which was also the fix for the React issue:

class Point {
  constructor(x, y) {
    this.x = NaN; this.x = x;
    this.y = NaN; this.y = y;
  }
}

But also like with all performance advice: Don’t do this blindly everywhere. A lot of your code is probably fine and unaffected performance-wise, even if objects undergo migrations sometimes.

Take-away from this section

Try to avoid mixing field values of different types, i.e. don’t mix numbers, strings, objects, other primitives, unless that’s what you intended to do. Specifically don’t pre-initialize number fields to null or undefined, but choose sensible default numbers (use NaN if in doubt). And try to initialize double fields (aka fields that are supposed to hold number values outside the small integer range) with double values outside the small integer range up-front, i.e. if in doubt put a NaN there first, and then store the actual initial value.

Conclusion

You should trust that the JavaScript engines usually do the right thing for you automatically. But as with everything, it’s good to be aware of some of the details, so in case you bump into problems, you’ll find your way around it, and sometimes giving the JavaScript engine an additional hint can be beneficial if the heuristics would otherwise fail badly.

1

At the time of this writing, Smis on 64-bit architectures are integers in the 32-bit range, but this is probably going to change to use 31-bit ranged integers everywhere in the near future.

2

Although there are also JavaScript VMs that manage to store short strings encoded into the value itself, for example the V7 Embedded JavaScript engine does that.

3

Choice of NaN is completely arbitrary here, it’s just the way I’d do it, as it signals to my future self reading this code that I didn’t accidentally overwrite the value immediately, but did this on purpose. You can use any non-small integer number here, i.e. 0.1, Number.MAX_VALUE or even -0 would all work here.

Liked the article? Subscribe below to get an email when new articles come out! Also, follow @ponyfoo on Twitter and @ponyfoo on Facebook.
One-click unsubscribe, anytime. Learn more.

Comments