Some thoughts on improving SqueakJS performance, by Vanessa in March 2021 (updated: November 2023) This is a much-more detailed version of the ideas outlined [before](https://github.com/codefrau/SqueakJS/wiki/JIT-Ideas-...). # Context Mapping ## Contexts are slow In Smalltalk, every send conceptually creates a new Context object referencing its sender, holding a method (for referencing literals) and an instruction pointer, as well as fields for the receiver (to access instance variables), arguments, and temporary variables of the activated method. There is also a variable-sized stack area for holding intermediate computation results and parameters to be passed as arguments to other methods. Having to allocate a new Context for every send, and initialize all the fields from the sending context by copying is expensive, making sends slow. ## Context-to-Stack mapping ... is _the_ established approach to avoid having to allocate a new Context object for every method activation. In their 1984 paper, L. Peter Deutsch and Allan M. Schiffman describe how to have multiple representations for Contexts, one optimized for execution speed in typical execution patterns, in addition to actual Context objects providing the full semantics as required by the Smalltalk language. The optimized representation is a linear stack that spans several method activations, organized so that receiver and method arguments do not have to be copied but can be used as-is. This is an extension of the overlapping stack frames in Smalltalk-78 which did increase speed and memory efficiency at the cost of losing some of the more advanced semantics. This new scheme preserved the full semantics. Organizing this as a stack maps very well to the execution on typical CPUs, which provide optimized instructions for working with stacks. Eliot did a [beautiful implementation](http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/) for Squeak. One detail simplifies that kind of stack organization: pointers. A pointer (e.g. stack pointer, frame pointer etc.) is a single machine word that can identify both the stack page and the index within that stack page. If we were to implement this in Javascript, we would have to either allocate an object to hold both the page and the index, or encode both of these in a number (we do have 53 bit integers available) and look up the page in an array. Or maybe all stack pages would be contained in a single array, in which case the pointer could indeed just be an index (although again, the code needs access to that stack array, in addition to the index). Either way, pointers do make this scheme efficient and elegant. ## Context-to-Var mapping In SqueakJS, the execution is not performed directly by CPUs, but by a Javascript VM. It is impossible to directly utilize whatever stack-based instructions the host CPU might have. The only available execution semantics are those of Javascript. And Javascript VMs are extremely good at running code fast that uses typical JS execution patterns. So while we could attempt a traditional Context-to-Stack mapping approach to speed up SqueakJS, it may be advantageous to consider a mapping of Smalltalk Contexts to the semantics of Javascript execution, which is temporary variables and function arguments. To gauge if this makes a significant difference, let's do a little benchmark. We'll implement the same send-heavy algorithm once with passing arguments on a stack, and once passing them as arguments: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ JS const stack = []; function benchFibStack() { let n = stack.pop(); if (n < 2) { stack.push(1); } else { stack.push(n - 1); benchFibStack(); stack.push(n - 2); benchFibStack(); stack.push(stack.pop() + stack.pop()); stack.push(1); stack.push(stack.pop() + stack.pop()); } } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [Fibonacci benchmark passing args via stack: ? million sends/s in this browser] As opposed to the hopefully more efficient ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ JS function benchFibArgs(n) { let result; if (n < 2) result = 1; else { let tmp0 = benchFibArgs(n - 1); let tmp1 = benchFibArgs(n - 2); let tmp2 = tmp0 + tmp1; result = tmp2 + 1; } return result; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [Fibonacci benchmark passing args directly: ? million sends/s in this browser] The caption of these code blocks should show the performance of `benchFib()` in this browser. Typically, the stack version is 2.5x-5x times slower than the argument-passing version (?x slower in this browser, reload the page to get more accurate results). Since the whole point of implementing a new JIT is speed, we might as well try to use the fastest option possible. ## Bi-directional mapping The question then is, can we establish a bi-directional mapping between contexts and frames, if there are no actual frames on a stack? The speedup comes not only from avoiding to allocate unnecessary contexts, but also to keep using fast frames as long as possible. Eliot's VM uses a clever scheme where a Context can act as a proxy to a live frame, so even when a context has to be created, the frame does not immediately have to be discarded. When the image is loaded, all contexts (e.g. the active context) are actual objects. The first few byte codes would be interpreted directly, writing into that context. But when a send invokes the function of a JIT-compiled method, then instead of allocating a new context object, we would just pop the arguments off the context stack into temp vars, pass them as arguments to the function and push the return value onto the stack. Inside of the jitted function, temp variables would be used instead of a stack, similar to the benchFibArgs example above. Sends that invoke other jitted functions would just pass arguments and return values directly. This should be extremely efficient. Where this scheme gets _interesting_ is when the execution progressed somewhat deep into a nested call chain and we then need to deal with contexts. It could be that execution is interrupted by a process switch, or that the code reads some fields of `thisContext`, or worse, writes into a field of `thisContext`. Other "interesting" occasions are garbage collections, or when we want to snapshot the image. Let's look at these in turn. ### Context Switch We need to generate a context for the current method which later needs to be able to be resumed. To suspend JS execution we can either try something coroutine-like via `function*/yield` or via promises and `async/await`. A different scheme would discard the whole call chain, possibly by `throw`ing an exception and generating real context objects in a `finally` handler in each method. This sounds rather expensive, but we should measure the impact. ### Context reads We create a context acting as proxy into a jitted method call. Somehow this would allow to read the pc, method, instruction pointer, and stack fields at that particular point of execution. But since JS has no introspection facilities to read temp vars, this might mean that we need to capture all the "live" variables before every send. If that is true then this would likely be just as expensive as creating contexts in the first place. Possibly even more expensive. We need something that has as little overhead as possible, to optimize the "fast" execution case where no introspection is needed. ### Context writes Now this sounds really hairy. For writing into temp vars we would have to give the context proxy some way of writing into each temp var. The only way to do that would be a function. This "remote introspection" function would have to be an inner function of every jit-compiled method. However, if we had that function, then it could be parameterized to read or write any temp, solving the context-read problem, too. ### GC For both incremental and full GC we need to be able to walk the full activation chain of methods, identify "live" temp vars, and trace them. That probably means we need to pass _something_ into each method call that would let us "remotely" peek into the execution state inside that activation. Again, the remote introspection function should be able to give access to all variables, which in this case could be used to create proxy contexts for each invocation. ### Snapshotting When a snapshot is required, we have no choice but to create actual context objects for all function invocations. If the GC creates all contexts then nothing special has to be done. Since no writes happen at snapshot time we might be able to keep the invocations alive. ### Interrupt Checks We need to check for interrupts (process switches) periodically. These checks are typically done on method invocations and backwards jumps (loops). A global `interruptCheckCounter` ensures that not every send and jump requires the expensive checks. ### Maximum callstack size The JS call stack is limited in depth -- the limits are generous (~10,000) but we should still design with that limit in mind. Simplest might be passing a `depth` argument into each jitted function, and passing `depth+1` on each send inside. Or maybe start with a max depth and decrement in nested calls. To check the depth we would pass the value to the interrupt check, so it it not done on every nested call. # Method lookup Another issue slowing down the interpreter, and the current JIT, is method lookups. The interface for each method invocation is a generic `send()` function that takes the selector, number of arguments, and super flag. The VM knows the receiver and walks up the class hierarchy to find the method corresponding to the selector. ## Global Caching In our interpreter, a global method cache is used to store recent lookup results. It's based on the hash of the selector and lookup class, and there are 1024 entries. This cache is also available to the current JIT, it uses the exact same method for sends as the interpreter. ## Inline Caching But the JIT can do much better than the global cache. It can easily store a separate lookup result for every send site. The selector is constant, and send sites often are monomorphic, meaning the receiver class is also constant, especially in performance-optimized code. Eliot [covers the subject in some detail](http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/) and even goes so far to claim that "inline cacheing [is] the main rationale behind the JIT." For our Javascript implementation, closure variables seem like a natural place to store lookup results inline. And if we only cache a single entry, the cache is called _monomorphic_. For each send, we store the receiver class, the lookup result (method), and the jitted function of that method. On each invocation we need to verify that the receiver is still of the same class, and if it is, we can directly call the function. Only if the class does not match do we need to perform the lookup (and in that case, the global cache will help). ~~~~~~~~~~~~~~~~~~~~ var cachedClass=null, cachedMethod, cachedFunc; if (cachedClass !== receiver.sqClass) { cachedClass = receiver.sqClass; cachedMethod = vm.lookup(cachedClass, selector, supered); cachedFunc = vm.compile(cachedMethod); } cachedFunc(...); ~~~~~~~~~~~~~~~~~~~~ One detail here is that this needs to work for both full Squeak objects that have a `sqClass` property, and SmallIntegers, which are represented as Javascript numbers. For them, `receiver.sqClass` will be `undefined`. To make the initial check fail, we initialize `cachedClass` to `null` instead of the default `undefined`. And inside of `vm.lookup()`, an `undefined` class will be treated as `SmallInteger`. ## Polymorphic Inline Caching Inline caching could go even further and store multiple lookup results per send site. This _polymorphic inline cache_ (PIC) helps with polymorphic code, where the receiver class is not constant. Eliot claims that ~90% of send sites are monomorphic, ~9% polymorphic, and 0.9% _megamorphic_ (too many classes to efficiently cache inline). We'll focus on the 90% for now and leave PICs as a future optimization. ## Relinking Send Sites We do not want to compile every method on its first encounter, because compilation is not free, it uses up time and memory. The current JIT sets a flag on a method when it is first activated, but then hands it off to the interpreter. Only on its second activation is it actually compiled. So even though for now we punt on implementing PICs (which naturally undergo an evolution from mono- to poly- to megamorphic), we still need to relink send sites. Unlike the C VM we cannot simply patch something in the middle of a Javascript function. But function references are held in variables, and we can re-assign variables. So instead of simply storing the compiled function in the cache ~~~~~~~~~~~~~~~~~~~~ JS cachedFunc = vm.compile(cachedMethod); ~~~~~~~~~~~~~~~~~~~~ we give the compiler a way to later replace the function with an optimized version: ~~~~~~~~~~~~~~~~~~~~ JS vm.compile(cachedMethod, f => cachedFunc = f); ~~~~~~~~~~~~~~~~~~~~ Alternatively, we don't use individual variables for these, and use an array instead. Then the VM can do both the lookup and compilation in one go, and update the cache: ~~~~~~~~~~~~~~~~~~~~ JS if (cache[0] !== receiver.sqClass) { vm.lookupAndCompile(cache, receiver.sqClass, selector, supered); } cache[2](...); ~~~~~~~~~~~~~~~~~~~~ This simplifies the generated code quite a bit, but we should measure the impact of calling functions from an array vs from a variable (in this browser performance appears to be ?% of direct calls). Typically, it is about the same. Regarding optimal performance, we should pay attention to Slava Egorov's JS Conf talk about dynamic dispatch ([Video](http://2014.jsconf.eu/speakers/vyacheslav-egorov-invokedynamic-js.html) and [Slides](http://mrale.ph/talks/jsconfeu2014/)). The scheme outlined here is partly inspired by that talk. ## Cache Invalidation When a new method is added to the system, or the class hierarchy is changed, old lookup results need to be invalidated. The VM provides three primitives to flush the caches -- one to invalidate everything, the other two to invalidate based on the method and the selector, respectively. The vehicle to invalidate the cache again could be the "remote introspection" function mentioned previously which lets us modify the variables inside our activations. However, we also need to be able find all the activations (for a global flush) and we want to support selective invalidation for a particular method or selector. TODO: design this invalidation housekeeping. Maybe keep a list of all send sites in the selector / method objects? And for full flushing do a treewalk (allInstances of Context)? # Compiler Optimizations Beyond a straight one-to-one mapping of bytecodes to JS instructions we could apply more optimizations. That's what most compilers nowadays do, and it is very much desirable to have for SqueakJS, too. But all optimizations have a cost. Our current SqueakJS JIT is optimized for simplicity and execution speed. It is single-pass, and simply concatenates template strings for every bytecode. No analysis is performed. However, we are not compiling to low-level machine code that will be executed as-is. We are targetting the Javascript JIT compiler, which will apply its own arsenal of optimizations to the code we give it. We just need to make it possible for it to do so. This is actually another argument for using the context-to-var mapping instead of using a stack. Consider what our current JIT generates for the expression `3 + 4`: ~~~~~~~~~~~~~~~~~~~~ stack = vm.activeContext.pointers; lit = method.pointers; stack[++vm.sp] = lit[1]; stack[++vm.sp] = lit[2]; var a = stack[vm.sp - 1], b = stack[vm.sp]; if (typeof a === 'number' && typeof b === 'number') { stack[--vm.sp] = vm.primHandler.signed32BitIntegerFor(a + b); } else { ... } ~~~~~~~~~~~~~~~~~~~~ This does not give the JS JIT much to work with. If instead we use temp vars instead of a stack, and resolve the literals at compile time: ~~~~~~~~~~~~~~~~~~~~ t0 = 3; t1 = 4; var a = t0, b = t1; if (typeof a === 'number' && typeof b === 'number') { t0 = vm.primHandler.signed32BitIntegerFor(a + b); } else { ... } ~~~~~~~~~~~~~~~~~~~~ In this case the JS JIT can infer that indeed `a` and `b` are both numbers so it can omit generating the `else` case completely and replace the whole thing with ~~~~~~~~~~~~~~~~~~~~ t1 = vm.primHandler.signed32BitIntegerFor(7); ~~~~~~~~~~~~~~~~~~~~ We could go even further and inline part of the `signed32BitIntegerFor` call: ~~~~~~~~~~~~~~~~~~~~ t0 = 3; t1 = 4; var a = t0, b = t1; if (typeof a === 'number' && typeof b === 'number') { a += b; t0 = a >= -0x40000000 && a <= 0x3FFFFFFF ? a : vm.makeLarge(a); } else { ... } ~~~~~~~~~~~~~~~~~~~~ which would allow the JS JIT to reduce this to ~~~~~~~~~~~~~~~~~~~~ t0 = 7; ~~~~~~~~~~~~~~~~~~~~ Our guiding principle will be to keep our own optimizations to a minimum in order to have quick compiles, but structure the generated code in a way so that the host JIT can perform its own optimizations well. # Sketching a JIT With all that in mind, let's imagine what a jitted method could look like. Maybe for `Integer>>benchFib`: ~~~~~~~~~~~~~~~~~~~~ benchFib ^ self < 2 ifTrue: [1] ifFalse: [(self - 1) benchFib + (self - 2) benchFib + 1] ~~~~~~~~~~~~~~~~~~~~ Bytecodes: ~~~~~~~~~~~~~~~~~~~~ 0 <70> push: self 1 <77> pushConst: 2 2 send: #< 3 <9A> jumpIfFalse: 7 4 <76> pushConst: 1 5 jumpTo: 18 7 <70> push: self 8 <76> pushConst: 1 9 send: #- 10 send: #benchFib 11 <70> push: self 12 <77> pushConst: 2 13 send: #- 14 send: #benchFib 15 send: #+ 16 <76> pushConst: 1 17 send: #+ 18 <7C> return: topOfStack ~~~~~~~~~~~~~~~~~~~~ ## Current JIT This is what the _current_ JIT generates, using the indexed fields of the real context object as stack (`vm.activeContext.pointers`): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ JS function Integer_benchFib(vm) { // Integer>>benchFib var context = vm.activeContext; var stack = context.pointers; var rcvr = vm.receiver; var lit = vm.method.pointers; while (true) switch (vm.pc) { case 0: stack[++vm.sp] = rcvr; stack[++vm.sp] = 2; var a = stack[vm.sp - 1], b = stack[vm.sp]; if (typeof a === 'number' && typeof b === 'number') { stack[--vm.sp] = a < b ? vm.trueObj : vm.falseObj; } else { vm.pc = 3; vm.sendSpecial(2); if (context !== vm.activeContext) return} case 3: var cond = stack[vm.sp--]; if (cond === vm.falseObj) {vm.pc = 7; continue} else if (cond !== vm.trueObj) {vm.sp++; vm.pc = 4; vm.send(vm.specialObjects[25], 0, false); return} case 4: stack[++vm.sp] = 1; vm.pc = 18; continue; case 7: stack[++vm.sp] = rcvr; stack[++vm.sp] = 1; var a = stack[vm.sp - 1], b = stack[vm.sp]; if (typeof a === 'number' && typeof b === 'number') { stack[--vm.sp] = vm.primHandler.signed32BitIntegerFor(a - b); } else { vm.pc = 10; vm.sendSpecial(1); if (context !== vm.activeContext) return} case 10: vm.pc = 11; vm.send(lit[1], 0, false); if (context !== vm.activeContext) return; case 11: stack[++vm.sp] = rcvr; stack[++vm.sp] = 2; var a = stack[vm.sp - 1], b = stack[vm.sp]; if (typeof a === 'number' && typeof b === 'number') { stack[--vm.sp] = vm.primHandler.signed32BitIntegerFor(a - b); } else { vm.pc = 14; vm.sendSpecial(1); if (context !== vm.activeContext) return} case 14: vm.pc = 15; vm.send(lit[1], 0, false); if (context !== vm.activeContext) return; case 15: var a = stack[vm.sp - 1], b = stack[vm.sp]; if (typeof a === 'number' && typeof b === 'number') { stack[--vm.sp] = vm.primHandler.signed32BitIntegerFor(a + b); } else { vm.pc = 16; vm.sendSpecial(0); if (context !== vm.activeContext) return} case 16: stack[++vm.sp] = 1; var a = stack[vm.sp - 1], b = stack[vm.sp]; if (typeof a === 'number' && typeof b === 'number') { stack[--vm.sp] = vm.primHandler.signed32BitIntegerFor(a + b); } else { vm.pc = 18; vm.sendSpecial(0); if (context !== vm.activeContext) return} case 18: vm.pc = 19; vm.doReturn(stack[vm.sp]); return; default: vm.interpretOne(true); return; } } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ## Idea: context proxy and inline caches Here is the same method with a context proxy and inline caches: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ JS var cache = new Array(3*7).fill(null); // [class, method, function] function Integer_benchFib(vm, depth, sender, method, closureOrNil, rcvr) { var pc = 0; var thisContext, t1, t2, t3; function thisProxy(op, index, value) { switch(op) { case "read": switch (index) { case 0: return sender; case 1: return pc; case 2: /* stackp */ return [0,1,2,1,0,1,0,1,2,1,1,2,3,2,2,1,2,1][pc]; case 3: return method; case 4: return closureOrNil; case 5: return rcvr; case 6: return t1; case 7: return t2; case 8: return t3; default: if (index === -1) { if (!thisContext) thisContext = vm.makeContext(thisProxy); return thisContext; } throw Error(); } case "write": switch(index) { case 0: sender = value; return; case 1: pc = value; return; case 2: stackp = value; return; case 3: method = value; return; case 4: closureOrNil = value; return; case 5: rcvr = value; return; case 6: t1 = value; return; case 7: t2 = value; return; case 8: t3 = value; return; default: if (index === -1 && !thisContext) { thisContext = value; return; } throw Error(); } case "invalidate": cache.fill(null); return; } } // handleDepthAndInterrupts checks nesting depth and context switches. // If necessary it uses thisProxy to reify this context (and possibly its senders) depth++; if (--vm.interruptCheckCounter <= 0 && vm.handleDepthAndInterrupts(depth, thisProxy) === true) return false; while (true) switch (pc) { case 0: t1 = rcvr; t2 = 2; if (typeof t1 === 'number' && typeof t2 === 'number') { t1 = t1 < t2 ? vm.trueObj : vm.falseObj; } else { if (cache[3*0] !== t1.sqClass) vm.updateCache(cache, 3*0, t1.sqClass, vm.specialSendSelector[2], false); // #< pc = 3; t1 = cache[3*0+2](vm, depth, thisProxy, cache[3*0+1], null, t1, t2); if (t1 === false) return false; } case 3: if (t1 === vm.falseObj) {pc = 7; continue} else if (t1 !== vm.trueObj) { if (cache[3*1] !== t1.sqClass) vm.updateCache(cache, 3*1, t1.sqClass, vm.specialObjects[25], false); // #mustBeBoolean pc = 4; t1 = cache[3*1+2](vm, depth, thisProxy, cache[3*1+1], null, t1); if (t1 === false) return false; } case 4: t1 = 1; pc = 18; continue; case 7: t1 = rcvr; t2 = 1; if (typeof t1 === 'number' && typeof t2 === 'number') { t1 -= t2 ; t1 >= -0x40000000 && t1 <= 0x3FFFFFFF ? t1 : vm.makeLarge(t1); } else { if (cache[3*2] !== t1.sqClass) vm.updateCache(cache, 3*2, t1.sqClass, vm.specialSendSelector[1], false); // #- pc = 10; t1 = cache[3*2+2](vm, depth, thisProxy, cache[3*2+1], null, t1, t2); if (t1 === false) return false; } case 10: if (cache[3*3] !== t1.sqClass) vm.updateCache(cache, 3*3, t1.sqClass, method.pointers[1], false); // #benchFib pc = 11; t1 = cache[3*3+2](vm, depth, thisProxy, cache[3*3+1], null, t1); if (t1 === false) return false; case 11: t2 = rcvr; t3 = 2; if (typeof t2 === 'number' && typeof t3 === 'number') { t2 -= t3 ; t2 >= -0x40000000 && t2 <= 0x3FFFFFFF ? t2 : vm.makeLarge(t2); } else { if (cache[3*4] !== t2.sqClass) vm.updateCache(cache, 3*4, t2.sqClass, vm.specialSendSelector[2], false); // #- pc = 14; t2 = cache[3*4+2](vm, depth, thisProxy, cache[3*4+1], null, t2, t3); if (t2 === false) return false;} case 14: if (cache[3*5] !== t2.sqClass) vm.updateCache(cache, 3*5, t2.sqClass, method.pointers[1], false); // #benchFib pc = 15; t2 = cache[3*5+2](vm, depth, thisProxy, cache[3*5+1], null, t2); if (t2 === false) return false; case 15: if (typeof t1 === 'number' && typeof t2 === 'number') { t1 += t2 ; t1 >= -0x40000000 && t1 <= 0x3FFFFFFF ? t1 : vm.makeLarge(t1); } else { if (cache[3*6] !== t1.sqClass) vm.updateCache(cache, 3*6,t1.sqClass, vm.specialSendSelector[0], false); // #+ pc = 16; t1 = cache[3*6+2](vm, depth, thisProxy, cache[3*6+1], null, t1, t2); if (t1 === false) return false; } case 16: t2 = 1; if (typeof t1 === 'number' && typeof t2 === 'number') { t1 += t2 ; t1 >= -0x40000000 && t1 <= 0x3FFFFFFF ? t1 : vm.makeLarge(t1); } else { if (cache[3*7] !== t1.sqClass) vm.updateCache(cache, 3*7, t1.sqClass, vm.specialSendSelector[0], false); // #+ pc = 18; t1 = cache[3*7+2](vm, depth, thisProxy, cache[3*7+1], null, t1, t2); if (t1 === false) return false; } case 18: return t1; } } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The main difference here is that the stack is "virtualized", and that method lookups are cached. The stack contents is held in temp vars (`t1`-`t3`) and the `thisProxy` function allows indexed access. The stack pointer is virtualized as well based on the `pc`. When the actual context is needed, `thisContext` is created on the fly. That would be a real context object, but it would hold onto the `thisProxy` function to fetch the actual values. The inline cache holds the result of the previous method lookup. Each cache line has 3 entries: a class, a method, and a function. If the receiver class is the same as in a previous invocation, the method and function are used without lookup. Otherwise, `vm.updateCache()` is invoked to do the method lookup and compilation. The result is written to the cache. To reduce the number of closure variables, a single `cache` array per method holds the entries for all its send sites. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ JS updateCache(cache, index, cls, selector, supered) { var method = this.lookup(cls, selector, supered); var func = method.compiled || this.compile(method); cache[index] = cls; cache[index+1] = method; cache[index+2] = func; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### Performance estimate Running the code above in a mockup harness on this browser shows a performance of ? million sends/s. You may inspect the browser's console to see the context inspection working -- once every million sends, the full context call chain is logged. You could also play with the code at https://codepen.io/codefrau/pen/JjbmVGw On my machine: | | Current JIT | New JIT Mockup | Speedup | vs Cog |---------|--------------|----------------|---------|------ | Chrome | 1.6 MSends/s | 38 MSends/s | 23x | 19% | Firefox | 2.2 MSends/s | 20 MSends/s | 9x | 10% | Safari | 4.1 MSends/s | 13 MSends/s | 3x | 7% For comparison, Eliot's Cog VM does about 200 MSends/s on the same machine. Which is pretty much exactly what the [simple benchmark](#contextmapping/context-to-varmapping) above achieves. The question is how to get close to that speed -- which will likely take a wholly different approach, so I'm not too worried about that. ## More realistic approach: `yield` The scheme above does not actually implement process switching inside of `handleDepthAndInterrupts()`. We need to be able to suspend the current chain of execution and resume another one. One way to do that in JS is via co-routines, using `function*` generators and `yield`. I implemented a mockup of this here: https://codepen.io/codefrau/pen/RwomBOK | | Current JIT | Yielding JIT | Speedup | vs Cog |---------|--------------|---------------|---------|------ | Chrome | 1.6 MSends/s | 13.5 MSends/s | 9x | 7% | Firefox | 2.2 MSends/s | 2.1 MSends/s | 1x | 1% | Safari | 4.1 MSends/s | 9.6 MSends/s | 2x | 5% This is significantly slower, because every suspendable function (that is every method) needs to be a generator `function*`, and needs to be invoked using `yield*`. This makes each send significantly slower, so much so that on Firefox this has no benefit over the current JIT at all. ## Exception-based scheme Based on C. Scott's [comments](https://twitter.com/cscottnet/status/1369012854528020488) I also implemented a different scheme. Instead of always allocating the `thisProxy` closure and passing it to inner sends, it wraps the whole method in a try/catch block and only reads the context variables on unwind. Any context access would trigger this unwinding and context allocation for the whole nested call sequence. This needs the help of an interpreter to be able to continue after unwind. Try it at https://codepen.io/codefrau/pen/bGBXpPN This looks to be at least 2x as efficient (with relatively few unwinds): | | Current JIT | Unwinding JIT | Speedup | vs Cog |---------|--------------|---------------|---------|-------- | Chrome | 1.6 MSends/s | 49 MSends/s | 30x | 25% | Firefox | 2.2 MSends/s | 8 MSends/s | 4x | 4% | Safari | 4.1 MSends/s | 18 MSends/s | 4x | 9% However, it's hard to guess if this is enough to offset the performance penalty of constantly unwinding and reifying contexts, losing inline caches etc. We would at least need to find a way to deal with block closures -- they do refer to thisContext, so in this scheme every iteration would trigger an unwind. It does simplify the design quite a bit though, since the nested calling would be strictly temporary, and not affect the rest of the VM at all. ## Exception-based scheme with Blocks and Non-local Returns There is a runnable mockup of block contexts (but not yet closures) with non-local returns using the exception-based scheme at https://codepen.io/codefrau/pen/YzNWpxO It implements Array>>do: ~~~~~~~~~~~~~~~~~~~~ do: aBlock 1 to: self size do: [:index | aBlock value: (self at: index)] ~~~~~~~~~~~~~~~~~~~~ Bytecodes: ~~~~~~~~~~~~~~~~~~~~ 0 <70> self 1 send: size 2 <6A> popIntoTemp: 2 3 <76> pushConstant: 1 4 <69> popIntoTemp: 1 5 <11> pushTemp: 1 6 <12> pushTemp: 2 7 send: <= 8 jumpFalse: 22 10 <10> pushTemp: 0 11 <70> self 12 <11> pushTemp: 1 13 send: at: 14 send: value: 15 <87> pop 16 <11> pushTemp: 1 17 <76> pushConstant: 1 18 send: + 19 <69> popIntoTemp: 1 20 jumpTo: 5 22 <78> returnSelf ~~~~~~~~~~~~~~~~~~~~ Jitted method: ~~~~~~~~~~~~~~~~~~~~ JS const cache = new Array(2*7).fill(null); // [class, function] const PCtoSP = [3,4,4,3,4,3,4,5,4,4,3,4,5,6,5,4,3,4,5,4,3,3,3]; return function Array_do_(rcvr, t0) { let pc = 0; let t1 = vm.nilObj, t2 = vm.nilObj; let s3, s4, s5, s6; try { if (--vm.depth <= 0 || --vm.interruptCheckCounter <= 0) throw { message: "interrupted" }; while (true) switch (pc) { case 0: // <70> self s3 = rcvr; case 1: // send: size if (s3.sqClass === "Array") s3 = s3.pointers.length; else { pc = 1; throw Error("todo: full send of #size"); } case 2: // <6A> popIntoTemp: 2 t2 = s3; case 3: // <76> pushConstant: 1 s3 = 1; case 4: // <69> popIntoTemp: 1 t1 = s3; case 5: // <11> pushTemp: 1 s3 = t1; case 6: // <12> pushTemp: 2 s4 = t2; case 7: // send: <= if (typeof s3 === 'number' && typeof s4 === 'number') { s3 = s3 <= s4 ? vm.trueObj : vm.falseObj; } else { pc = 7; throw Error("todo: full send of <="); } case 8: // jumpFalse: 22 if (s3 === vm.falseObj) { pc = 22; continue } else if (s3 !== vm.trueObj) { pc = 8; throw Error("todo: send #mustBeBoolean"); } case 10: // <10> pushTemp: 0 s3 = t0; case 11: // <70> self s4 = rcvr; case 12: // <11> pushTemp: 1 s5 = t1; case 13: // send: at: if (s4.sqClass === "Array" && typeof s5 === "number" && s5 > 0 && s5 <= s4.pointers.length) s4 = s4.pointers[s5 - 1]; else { pc = 13; throw Error("todo: full send of #at:"); } case 14: // send: value: if (s3.sqClass === "BlockContext") { pc = 14; if (!s3.compiled) vm.jitBlock(s3); // might throw pc = 15; s3.compiled(s4); } else { pc = 14; throw Error("todo: full send of #value:"); } case 15: // <87> pop // no-op case 16: // <11> pushTemp: 1 s3 = t1; case 17: // <76> pushConstant: 1 s4 = 1; case 18: // send: + if (typeof s3 === 'number' && typeof s4 === 'number') { s3 += s4; if (s3 < -0x40000000 || s3 > 0x3FFFFFFF) s3 = vm.makeLarge(t2); } else { pc = 18; throw Error("todo: full send of #+"); } case 19: // <69> popIntoTemp: 1 t1 = s3; case 20: // jumpTo: 5 pc = 5; continue; case 22: // <78> returnSelf vm.depth++; return rcvr; default: throw Error("not implemented @" + pc); } } catch (frame) { if ("nonLocalReturnValue" in frame) { vm.depth++; throw frame; } // reify context and unwind all the way out of JIT frame.ctx = { class: "MethodContext", pointers: [frame.ctx, pc, PCtoSP[pc], method, vm.nilObj, rcvr, t0, t1, t2, s3, s4, s5, s6] }; throw frame; } }; ~~~~~~~~~~~~~~~~~~~~ which is used by Array>>has42 ~~~~~~~~~~~~~~~~~~~~ has42 self do: [:i | i == 42 ifTrue: [^true] ]. ^false ~~~~~~~~~~~~~~~~~~~~ Bytecodes: ~~~~~~~~~~~~~~~~~~~~ 0 <70> self 1 <89> pushThisContext: 2 <76> pushConstant: 1 3 send: blockCopy: <== push block instance 4 jumpTo: 14 6 <68> popIntoTemp: 0 <== start of block code 7 <10> pushTemp: 0 8 <20> pushConstant: 42 9 send: == 10 <98> jumpFalse: 12 11 <79> return: true <== non-local return 12 <73> pushConstant: nil 13 <7D> blockReturn <== end of block code 14 send: do: 15 <87> pop 16 <7A> return: false ~~~~~~~~~~~~~~~~~~~~ JavaScript: ~~~~~~~~~~~~~~~~~~~~ JS const cache = new Array(2*7).fill(null); // [class, function] const PCtoSP = [1,2,3,4,3,3,1,0,1,2,1,0,0,1,2,1,0]; function Array_has42(rcvr) { let pc = 0; let t0 = vm.nilObj; let s1, s2, s3, s4; const thisContext = { sqClass: "MethodContext" }; let returnFromThisContext; try { if (--vm.depth <= 0 || --vm.interruptCheckCounter <= 0) throw { message: "interrupted" }; while (true) switch (pc) { case 0: // <70> self s1 = rcvr; case 1: // <89> pushThisContext: pc = 1; // in case we have to throw s2 = thisContext; case 2: // <76> pushConstant: 1 s3 = 1; case 3: // send: blockCopy: if (s2 === thisContext && typeof s3 === "number") { s2 = { sqClass: "BlockContext", pointers: [vm.nilObj, 0, 0, s3, 6, s2], compiled: function Array_has42_6(b0) { if (thisContext.pointers) throw Error("method context was unwound, cannot run compiled block"); let pc = 6; let b1, b2; try { if (--vm.depth <= 0 || --vm.interruptCheckCounter <= 0) throw { message: "interrupted" }; // case 4 jumpTo: 14 while (true) switch (pc) { case 6: // <68> popIntoTemp: 0 t0 = b0; case 7: // <10> pushTemp: 0 b0 = t0; case 8: // <20> pushConstant: 42 b1 = 42; case 9: // send: == b0 = b0 === b1 ? vm.trueObj : vm.falseObj; case 10: // <98> jumpFalse: 12 if (b0 === vm.falseObj) { pc = 12; continue } else if (b0 !== vm.trueObj) { pc = 10; throw Error("todo: send #mustBeBoolean"); } case 11: // <79> return: true // non-local return throw returnFromThisContext = { message: "non-local return", ctx: this.pointers[BLK_HOME], nonLocalReturnValue: vm.trueObj, }; case 12: // <73> pushConstant: nil b0 = vm.nilObj; case 13: // <7D> blockReturn vm.depth++; return b0; default: throw Error("block6 unimplemented @" + pc); } } catch (frame) { if ("nonLocalReturnValue" in frame) { vm.depth++; throw frame; } // reify context and unwind all the way out of JIT this.pointers[SENDER] = frame.ctx; this.pointers[PC] = pc; this.pointers[STACKP] = PCtoSP[pc]; this.pointers.push(b0, b1, b2); frame.ctx = this; throw frame; } }, }; } else throw Error("thisContext"); case 14: // send: do: if (cache[2*0] !== s1.sqClass) vm.jitMethod(cache, 2*0, s1.sqClass, vm.specialSelectors[54], false); // #do: pc = 15; s1 = cache[2*0+1](s1, s2); case 15: // <87> pop // noop case 16: // <7A> return: false vm.depth++; return vm.falseObj; default: throw Error("not implemented @" + pc); } } catch (frame) { if ("nonLocalReturnValue" in frame) { vm.depth++; if (frame === returnFromThisContext) return frame.nonLocalReturnValue; throw frame; } // reify context and unwind all the way out of JIT thisContext.pointers = [frame.ctx, pc, PCtoSP[pc], method, vm.nilObj, rcvr, t0, s1, s2, s3, s4]; frame.ctx = thisContext; throw frame; } } return Array_has42; ~~~~~~~~~~~~~~~~~~~~ which are used in a simple DOIT ~~~~~~~~~~~~~~~~~~~~ DoIt #(3 4 7 13) has42. #(3 4 42 13) has42. ~~~~~~~~~~~~~~~~~~~~ Bytecodes: ~~~~~~~~~~~~~~~~~~~~ 0 <21> pushConstant: #(3 4 7 13) 1 send: has42 2 <87> pop 3 <22> pushConstant: #(3 4 42 13) 4 send: has42 5 <87> pop 6 <78> returnSelf ~~~~~~~~~~~~~~~~~~~~ The second invocation of `has42` does find a 42, it successfully does a non-local return, proving that the scheme is viable. ### Performance estimate _Update Nov 2023:_ To see how this idea (and some more optimizations) would perform, I implemented a [mockup of benchFib](jit-perf.html) (click that link to see it running in this browser). The code is at https://github.com/codefrau/SqueakJS/blob/gh-pages/docs/jit-perf.js On my machine (this is a newer machine than above, so the numbers are not directly comparable): | Chrome MSends/s | Safari MSends/s | FireFox MSends/s | |--------|--------|--------|--------- | 3 | 7 | 4 | current JIT | 88 | 48 | 62 | JIT idea | 99 | 49 | 85 | no array: use locals for inline cache |172 | 61 | 70 | no switch: convert jumps to control flow |187 | 64 | 106 | no switch, no array |308 | 76 | 175 | no switch, no array, no type checks: assume an unsafe bytecode (SISTA) that does not require type checks |_339_ | _626_ | _260_ | _plain JS (not a JIT mockup)_ For comparison, Eliot's Cog VM does about *310 MSends/s* on the same machine. That is pretty much identical to the full Chrome JavaScript performance, only Safari gets 2x that speed (then again, this is an Apple-silicon Mac). For the JIT mockups, Safari does a lot worse, where Chrome comes out way ahead. The numbers are promising, even if they're not entirely realistic, they show that a lot more performance is possible. ## One more idea: Stack capture and restoration The exception-based scheme lets us capture the stack on unwind, but not restore it. When resuming a process, we have to use the interpreter for all the contexts that were unwound. In the yield-based scheme the call-chain is preserved, but it does not support capturing the stack. In both cases, there's no way to recreate the stack starting from actual context objects. An idea to do that is inspired by [Stopify](https://arxiv.org/abs/1802.02974). It basically instruments each method in a way that allows "normal" invocations that perform the computation and return a result, but also invocations that only "restore" the call chain based on a global "mode" flag, after capturing it by setting the mode to "capture", which causes an immediate return all the way up the call chain. This scheme can be implemented without exceptions or yield, so potentially is faster than both. Here is an example function from the Stopify paper: ~~~~~~~~~~~~~~~~~~~~ JS function P(f, g, x) { return f(g(x)); } ~~~~~~~~~~~~~~~~~~~~ This gets instrumented as follows: ~~~~~~~~~~~~~~~~~~~~ JS var mode = 'normal'; var stack = []; function P(f, g, x) { var t0, l, k; if (mode === 'restore') { k = stack.pop(); [t0] = k.locals; l = k.label; k = stack[stack.length - 1]; } function locals() { return [t0]; } function reenter() { return P(f,g,x); } if (mode === 'normal' || (mode === 'restore' && l === 0)) { t0 = mode === 'normal' ? g(x) : k.reenter(); if (mode === 'capture') { stack.push({ label: 0, locals: locals(), reenter: reenter }); return; } } return f(t0); } ~~~~~~~~~~~~~~~~~~~~ _TODO: implement this on codepen_ # What about ... ## Primitive calls Non-inlined primitives expect the arguments on a stack. They will have to be rewritten to take advantage of direct argument passing. ## Object representation Currently objects are "optimized" for interpreter access, by storing instance variables in an array. Using object properties instead of arrays would likely be better for the JIT. The JIT can pre-compute that property name, but the interpreter would have to do it for every access. We could use `object.inst0`...`object.instN` to at least allow the interpreter access via a computed field name like `object["inst"+index]`. Even more efficient for the JIT (and even less for the interpreter) would be using class-specific inst var names to allow object-shape specialization. Garbage collection might be an issue since it needs to follow references from every instance variable. Iterating over indexed fields is simple and efficient, iterating over object properties less so. Also, all primitives accessing inst vars would need to be changed. # What next? I'll gather feedback from people, ~~and will try to implement a mockup of this. Maybe by hand-coding a version of `benchFib`.~~ (Update: mockups added above) For feedback, maybe use this [issue](https://github.com/codefrau/SqueakJS/issues/121) on the bug tracker. Happy to discuss elsewhere, too ([email](http://lists.squeak.org/mailman/listinfo/vm-dev) / [twitter](https://twitter.com/codefrau)).