Astro - Hacker News

25 comments

tosh 3 hours ago

> in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS
wild
[-]
- patmorgan23 2 hours ago
  
  Why are we comparing a programing language and a GPU. This is a category error. Programing languages do not do any operations. They perform no FLOPs, they are the thing the FLOPs are performing.
  "The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)
  [-]
  - tosh 39 minutes ago
    
    the sentence is ambiguous because "Python" can mean python + a certain library and even a different Python implementation
    but I find it illuminating to compare what a certain hardware can do in principle (what is possible) vs what I can "reach" as programmer within a certain system/setup
    in this case NVIDIA A100 vs "Python" that does not reach a A100 (without the help of CUDA and PyTorch)
    another analogy:
    I find it useful to be able to compare what the fastest known way is to move a container from A to B using a certain vehicle (e.g. truck) and how that compares to how fast a person that can not drive that truck can do it + variants of it (on foot, using a cargo bike, using a boat via waterway, …)
    I'm also interested in how much energy is needed, how much the hw costs and so on
    Often there are many ways to do things, comparing is a great starting point for learning more
    
    [-]
    
    tosh 36 minutes ago
    
    related to the truck analogy: an advantage of the way slower Python approach is: it does not need a GPU
    that said: Python can get to more FLOPs by changing the representation: https://docs.python.org/3/library/array.html
  - gchamonlive 28 minutes ago
    
    > Why are we comparing a programing language and a GPU.
    You are taking the statement too literally and forgetting it's a figure of speech, specifically metonymy.
    When the author says it's millions of flops faster in a gpu than in an interpreteted programming language, it's not comparing them directly, but algorithms that run in them, so the substitution is the algorithms for the tools used to implement/run them.
    It makes sense if you say "running similar logic -- like multiplying vectors and matrices -- on the CPU is millions of flops slower then on the GPU". There is no category error there.
- itishappy 7 minutes ago
  
  Which, lets be honest, is probably still being orchestrated by Python somewhere.
  Python is 9.75 million times faster than Python.
  [-]
  - giancarlostoro 3 minutes ago
    
    I was researching if there was much benefit to using Rust or C++ over Python for AI, and turns out, the GPU doesn't care once the instructions are in because its an entirely different spec running on the GPU. The only thing you might save on is "startup" costs of getting your code into the GPU I guess? I assume that time cost is miniscule though, once its all in memory, nobody cares that you spent any time "booting it up" any more than how long Windows takes these days.
- tosh an hour ago
  
  re comments:
  yes of course this is apples to oranges but that's kind of the point
  it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU
  the interesting thing is why that is so
  CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …
  [-]
  - p1esk 32 minutes ago
    
    A100 FP32 throughput “at its limit”: 19.5 TFLOP/s.
    AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).
    
    [-]
    
    tosh 19 minutes ago
    
    A100: 312 TFLOP/s for FP16
    but it is very impressive how far modern CPUs get as well (also in smart phones!)
- p1esk 2 hours ago
  
  This statement makes zero sense
- xyzsparetimexyz 3 hours ago
  
  Single core vs multi core accounts for much of this
  [-]
  - cdavid 2 hours ago
    
    Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU.
    The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.
    See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.
    
    [-]
    
    p1esk 27 minutes ago
    
    Theoretical FP32 performance of AMD EPYC 9965 is double that of A100: 41.2 TFLOP/s vs 19.5 TFLOP/s
jdw64 2 hours ago

Right now, all I know how to do is pull models from Hugging Face, but someday I want to build my own small LLM from scratch
[-]
- max-amb 2 hours ago
  
  If you want a written resource I have a blog post about the mathematics behind building a feed forward from scratch, https://max-amb.github.io/blog/the_maths_behind_the_mlp/. Kinda focuses on translation from individual components to matrix operations.
- kflansburg 2 hours ago
  
  If you aren't already aware, Karpathy has several videos that could get you there in a few hours https://www.youtube.com/@AndrejKarpathy
  [-]
  - jdw64 2 hours ago
    
    very thanks!
- glouwbug 2 hours ago
  
  It’s just linear algebra. Work your way from feed forward to CNN to RNN to LSTM to attention then maybe a small inference engine. Kaparthy’s llama2.c is only ~300 lines on the latter and it pragma simds so you don’t need fancy GPUs
noosphr 3 hours ago

>For example, getting good performance on a dataset with deep learning also involves a lot of guesswork. But, if your training loss is way lower than your test loss, you're in the "overfitting" regime, and you're wasting your time if you try to increase the capacity of your model.
https://arxiv.org/abs/1912.02292
[-]
- appplication 3 hours ago
  
  Generally, posting a link-only reply without further elaboration comes across as a bit rude. Are you providing support for the above point? Refuting it? You felt compelled to comment, a few words to indicate what you’re actually trying to say would go a long way.
  [-]
  - noosphr 2 hours ago
    
    >We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better.
    
    [-]
    
    smallerize 11 minutes ago
    
    Does this mean that if your model is "overfitting", the solution is to train for even more epochs?
    
    ForceBru an hour ago
    
    Right, isn't double descent one of the reasons why modern Extremely Large Language Models work at all? I think I heard somewhere that basically all today's "smart" (reasoning, solving math problems, etc) LLMs are trained in the "double descent" territory (whatever this means, I'm not entirely sure).
    
    [-]
    
    SiempreViernes an hour ago
    
    No, double descent is a symptom of whatever it is that makes the deep models work at all. It's just the name for something you see happen when it works. The reason it works has something to do with how all those extra dimensions work as a regularisation term in the fit.