Astro - Hacker News

30 comments

moring 17 minutes ago

The article shows nicely how "every byte matters" is false. First, it starts off by talking about the cost of a new field, when the actual topic is array-of-structs vs. struct-of-arrays. Then, this:
> How much of an impact can this have? > Reading is:alive (1 byte) Across 1M Monsters
You aren't reading one byte here, you are reading 1M bytes! Of course, optimizing the access to 1M bytes is something to consider. Optimizing the access to one byte isn't.
The article is definitely worth reading IMHO, but it really needs a better headline!
[-]
- jayd16 13 minutes ago
  
  Even more so, it shows that SoA data structure means you can add fields to your 1M monsters with little impact.
noelwelsh 2 hours ago

The JVM is currently pretty bad for memory allocation. Every object (i.e. not a primitive) has a header that IIRC is 12 bytes. But there is good news in JVM land: this will be reduced to 8 bytes in the next JVM release, and Project Valhalla will give the tools to do away with headers entirely in some cases. Project Valhalla also has tools to manage off-heap memory, which is important in many cases.
The JVM is an odd place where it requires too much heap to compete with the AOT compiled languages, but its startup time is too slow compared to interpreted languages. I think these enhancements are essential to keep the platform relevant.
[-]
- pron 2 hours ago
  
  > Every object (i.e. not a primitive) has a header that IIRC is 12 bytes. But there is good news in JVM land: this will be reduced to 8 bytes in the next JVM release
  Since JDK 25 it's already 64 bits with the `-XX:+UseCompactObjectHeaders` flag [1], but in JDK 27 it will be the default [2].
  > where it requires too much heap to compete with the AOT compiled languages
  Not to compete but to beat, and not too much, but the right amount. Low level languages are optimised for control, not performance (that control translates to better performance in smaller programs, and to worse performance in larger programs), and their particular constraints prevent them from enjoying certain important optimisations, especially those offered by JIT compilation and moving collectors, which remove some overheads that AOT compilers and free-list allocators incur. Their memory management is forced (by their constraints) to optimise for footprint rather than speed.
  There are common misunderstandings about memory management and why moving collectors were created to reduce the CPU overheads of malloc/free, especially in large programs, in exchange for what is effectively free RAM. This is why moving collectors are chosen by the languages that are unconstrained enough to use them and have the resources to implement them (Java, .NET, V8). With the exception of Zig (and even there it requires some effort), it's hard for low level languages to use the basic optimisation that's behind moving collectors. I gave a talk about how moving collectors optimise memory management at the last Java One, and it should be available on YouTube soonish [3].
  > but its startup time is too slow compared to interpreted languages
  That hasn't been the case for some time. You are right, though, that startup/warmup time is worse than in AOT compiled languages, and that is the tradeoff of optimising JITs: reduce the overheads associated with AOT compilation in large program in exchange for warmup.
  Both startup and warmup have already been improved thanks to Project Leyden's "AOT cache" [4], but it will never be as low as C.
  In general, the tradeoff is between optimisations that help large programs vs optimisations that help small programs.
  [1]: https://openjdk.org/jeps/519
  [2]: https://openjdk.org/jeps/534
  [3]: I can't reproduce the full talk (which goes into the maths of memory management) here but what happened with moving collectors was that until very recently (open source low-latency moving collectors are newer than ChatGPT), they required pauses and so weren't suitable for programs requiring low latencies. As a result, many developers either forgot or never learnt just how incredibly efficient moving collectors are. But the key is that because accessing RAM by necessity requires CPU, using CPU effectively captures RAM even it's not used by the program. Bringing the CPU and RAM usage into a good balance is more efficient than trying to minimise one or the other. This is also the reason why hardware (physical or virtual) is packaged within a very narrow band of RAM/core ratio.
  [4]: https://www.youtube.com/watch
  [-]
  - AlotOfReading 30 minutes ago
    
    In general, the tradeoff is between optimisations that help large programs vs optimisations that help small programs.
    Do you have concrete examples of large scale Java programs that are significantly more performant than comparable programs in native languages like C++? My understanding was that this dynamic hadn't fundamentally changed much since the 2010s, when Java was able to occasionally edge out a win in 1-2 benchmarks and would lose handily in others. My experience is that large scale Java programs remain a bit of a bear even after significant optimization effort (e.g. Bazel).
    There are of course plenty of optimizations the JVM does that aren't possible AOT, but that that doesn't imply an automatic win at large scales, as Rust demonstrates.
  - pharrington 23 minutes ago
    
    Your Project Leyden's "AOT cache" Youtube link is broken, did you mean to link to https://www.youtube.com/watch?v=fiBNDT9r_4I?
- kakacik 2 hours ago
  
  Most of real world use of Java platform has next to 0 concerns like those. Some more niche use case may benefit, good, but overall success map isn't changing anytime soon. Reasons for its long term success lie elsewhere.
  [-]
  - re-thc a minute ago
    
    Not true. Lots of large Java deployments with millions to billions in cloud spend. The Java part of it isn’t 0.
    Memory isn’t free. CPU isn’t free.
  - FartyMcFarter 2 hours ago
    
    Android Java apps' memory consumption is definitely a relevant concern.
forinti 3 hours ago

So if you need speed, you just have to swallow your OO programmer's pride and put your data in arrays.
[-]
- jayd16 19 minutes ago
  
  If you have hot loops with millions of iterations at a time, structure your code accordingly. Its not anti-OO to choose the right data structure for the job.
- bob1029 an hour ago
  
  And avoid moving said data between physical threads as much as possible.
  Most of the bottlenecks I see are not due to the organization of data. Unnecessary communication of data is the #1 offender.
- theandrewbailey 2 hours ago
  
  Maybe someone can write an OO language where arrays of structs are automatically stored as structs of arrays.
  mild /s
  [-]
  - fp64 2 hours ago
    
    Odin has some helpers, was one of the more interesting features I found, but never tried. Not sure if you want to consider Odin OO, but well https://odin-lang.org/docs/overview/#soa-struct-arrays
  - tlb 2 hours ago
    
    There's a package to do this in Julia: https://juliaarrays.github.io/StructArrays.jl/stable/
  - Mizza 2 hours ago
    
    Are you talking about Zig's MultiArrayList?
    
    [-]
    
    alex7o 2 hours ago
    
    He is talking about jai the programing language from Jonathan Blow, which is quite cool but there is no way to access it.
pron 2 hours ago

> The cost of each new field is rarely considered
Most developers, in Java and in most other languages, do not consider the cost of every field, but I can tell you that people who need micro-optimisations certainly do care, and in Java's standard library, a layout is very much a concern (except, as always, you want to optimise what really matters; there's no point in optimising something that is unlikely to be a hot spot in a real program). Sometimes, though, you want to intentionally spread out the layout to avoid cache line sharing when concurrency is involved. You will find such examples in the standard library, too.
ssiddharth 2 hours ago

Slight tangent, but every ms, μs, and ns counts too. We've gotten awfully carefree with response times and wasted compute cycles.
Luff 36 minutes ago

Yes we should end the hateful rhetoric of most and least significant bytes. Every Byte Matters.
[-]
- zabzonk 33 minutes ago
  
  We need an ending to byte-sizeism as well.
coldcity_again 3 hours ago

I love to see stuff like this. And an active Vectrex gamedev and PC/Amiga sizecoder I strongly agree with the sentiment!
AxelWickman an hour ago

Cool read. The AoS vs SoA speaks for itself.
coolThingsFirst 2 hours ago

Why doesn’t the machine fill up the other cache lines as well why is 64 bytes only and then a miss?
[-]
- spiffyk 33 minutes ago
  
  A cache line is simply the unit of data a CPU cache works with (generally 64 bytes, because someone somewhere has probably determined that that is the best line size for general use), much like there are units of data like bytes (8 bits nowadays, but there have been weird ones historically), pages (varies between hardware as well, and may be OS-configurable), etc.
  As TFA mentions, a CPU does some predictions about what cache lines to prefetch, e.g. when you do sequential reads. Moreover, the x86_64 instruction set provides a prefetch instruction through which you are able to give the CPU a hint "hey, I'm gonna be using this soon, prepare accordingly, pretty please".
  Still, the utility of prefetching is diminished if you only use a single byte from each cache line, because the mechanism generally depends on you doing other work while the next cache line is being fetched. So really the best case scenario is to take as much time as possible to work with what is already fetched, so that there is time for the next unit of data to be fetched in the meantime.
- masklinn an hour ago
  
  They will absolutely do that (prefetching, they can even eagerly load what’s on the other side of a pointer).
  However it requires additional hardware to recognize patterns which benefit from prefetching, and every time the CPU prefetches data which ends up not being used it has both burned energy and memory bandwidth, and evicted data from the cache which might be needed (cache pollution).
- Liquid_Fire an hour ago
  
  It might sometimes prefetch the surrounding lines as well, but ultimately cache space is limited, so there is a trade-off. Every time you fill a line, you are throwing away something else that was cached there previously, which you may need again in the near future.
yas_hmaheshwari 2 hours ago

Out of course: I had thought about reading an article about Iran war or some geo political news when I read fzakaria :-)
RickJWagner 2 hours ago

That’s a great read. I wish more people wrote like that.
[-]
- fdegmecic an hour ago
  
  CppCon 2014: Mike Acton "Data-Oriented Design and C++"
  Andrew Kelley: A Practical Guide to Applying Data Oriented Design (DoD)
  you should check these two talks out then.