Worth mentoning that MeshOptimizer (https://github.com/zeux/meshoptimizer) has become one of a handful 'hidden champion' pillar libraries that probably carries half of the gaming industry.
While we could utilize zigzag encoding (i>>31) ^ (i<<1) to convert SLEB128-encoded type/addend to use ULEB128 instead, the generate code is inferior to or on par with SLEB128 for one-byte encodings on x86, AArch64, and RISC-V. Haven't tried wider values - but zigzag encoding is likely slower as well
// One-byte case for SLEB128
int64_t from_signext(uint64_t v) {
return v < 64 ? v - 128 : v;
}
// One-byte case for ULEB128 with zig-zag encoding
int64_t from_zigzag(uint64_t z) {
return (z >> 1) ^ -(z & 1);
}
Now why can't compilers do this sort of thing automatically?
Almost any problem seems to be possible to speed up 1000x in AVX512+days of thought compared to the naive version written in a python loop. If we could automate that whole process for big codebases the performance gains could be huge.
Compilers can’t really, in a meaningful way, change the layout of your data in memory. And you do need to think about your memory layout to get any benefit from SIMD. You’ll notice a lot of compiler auto vectorization insert many instructions just to shuffle data around to get to a usable layout, which negates much of the benefit.
Are there any programming languages which change the data layout beyond naively sorting struct members by alignment? (which at best helps with reducing padding bytes but can be either good or bad for performance, depending on the code which accesses the data).
Intents matter. Compilers can't see through your skull to infer your intents and thus behave very conservatively unless you override that behavior somehow. This inference, alas, also takes (much) time, so compilers have to balance the compilation time with quality of intents guessed as well. (This is why we can't exactly use LLMs in mainstream compilers, by the way.) So go and make a programming language that preserves your intents by every means; but making it practical would be very difficult.
> 1000x in AVX512+days of thought compared to the naive version written in a python loop
Out of this 1000x speedup you get 100x by just not using python though ;)
Also IIRC the main problem specifically with AVX512 was that mainstream CPUs simply didn't have it, so a smart compiler won't be of much use when the output code only runs on a handful devices.
Worth mentoning that MeshOptimizer (https://github.com/zeux/meshoptimizer) has become one of a handful 'hidden champion' pillar libraries that probably carries half of the gaming industry.
Basically the curl of asset pipelines ;)
https://github.com/zeux/meshoptimizer/discussions/986
While we could utilize zigzag encoding (i>>31) ^ (i<<1) to convert SLEB128-encoded type/addend to use ULEB128 instead, the generate code is inferior to or on par with SLEB128 for one-byte encodings on x86, AArch64, and RISC-V. Haven't tried wider values - but zigzag encoding is likely slower as well
// One-byte case for SLEB128 int64_t from_signext(uint64_t v) { return v < 64 ? v - 128 : v; }
// One-byte case for ULEB128 with zig-zag encoding int64_t from_zigzag(uint64_t z) { return (z >> 1) ^ -(z & 1); }
This sort of analysis is great.
Now why can't compilers do this sort of thing automatically?
Almost any problem seems to be possible to speed up 1000x in AVX512+days of thought compared to the naive version written in a python loop. If we could automate that whole process for big codebases the performance gains could be huge.
Compilers can’t really, in a meaningful way, change the layout of your data in memory. And you do need to think about your memory layout to get any benefit from SIMD. You’ll notice a lot of compiler auto vectorization insert many instructions just to shuffle data around to get to a usable layout, which negates much of the benefit.
Depends on the programming language. A good question is why we don't have more optimizable languages in mainstream use.
Are there any programming languages which change the data layout beyond naively sorting struct members by alignment? (which at best helps with reducing padding bytes but can be either good or bad for performance, depending on the code which accesses the data).
> Now why can't compilers do this sort of thing automatically?
They do - they just can't assume GFNI instructions are present unless you explicitly say so: https://godbolt.org/z/eYasbKsse
Intents matter. Compilers can't see through your skull to infer your intents and thus behave very conservatively unless you override that behavior somehow. This inference, alas, also takes (much) time, so compilers have to balance the compilation time with quality of intents guessed as well. (This is why we can't exactly use LLMs in mainstream compilers, by the way.) So go and make a programming language that preserves your intents by every means; but making it practical would be very difficult.
> 1000x in AVX512+days of thought compared to the naive version written in a python loop
Out of this 1000x speedup you get 100x by just not using python though ;)
Also IIRC the main problem specifically with AVX512 was that mainstream CPUs simply didn't have it, so a smart compiler won't be of much use when the output code only runs on a handful devices.