I thought the point of something like Strix Halo was to avoid ROCm all together? AMDs strategy seems to have been to unify GPU/CPU memory then let people write their own libraries.
The industry looks like it's started to move towards Vulkan. If AMD cards have figured out how to reliably run compute shaders without locking up (never a given in my experience, but that was some time ago) then there shouldn't be a reason to use speciality APIs or software written by AMD outside of drivers.
ROCm was always a bit problematic, but the issue was if AMD card's weren't good enough for AMD engineers to reliably support tensor multiplication then there was no way anyone else was going to be able to do it. It isn't like anyone is confused about multiplying matricies together, it isn't for everyone but the naive algorithm is a core undergrad topic and the advanced algorithms surely aren't that crazy to implement. It was never a library problem.
owning GGUF conversion step is good in sone circumstances, but running in fp16 is below optimal for this hardware due to low-ish bandwidth.
It looks like context is set to 32k which is the bare minimum needed for OpenCode with its ~10k initial system prompt. So overall, something like Unsloth's UD q8 XL or q6 XL quants free up a lot of memory and bandwidth moving into the next tier of usefulness.
Thanks for sharing. However, this missed being a good
writeup due to lack of numbers and data.
I'll give a specific example in my feedback,
You said:
```
so far, so good, I was able to play with PyTorch and run Qwen3.6 on llama.cpp with a large context window
```
But there are no numbers, results or output paste.
Performance, or timings.
Anyone with ram can run these models, it will just be impracticably slow. The halo strix is for a descent performance, so you sharing numbers will be valuable here.
To give benefit of doubt, author does state multiple times (including in the title) that these were "first impressions", so perhaps they should have mentioned something like "...In the next post, we'll explore performance and numbers" to avoid a cliffhanger situation, or do a part 1 (assuming the intention was to follow-up with a part 2).
I thought the point of something like Strix Halo was to avoid ROCm all together? AMDs strategy seems to have been to unify GPU/CPU memory then let people write their own libraries.
The industry looks like it's started to move towards Vulkan. If AMD cards have figured out how to reliably run compute shaders without locking up (never a given in my experience, but that was some time ago) then there shouldn't be a reason to use speciality APIs or software written by AMD outside of drivers.
ROCm was always a bit problematic, but the issue was if AMD card's weren't good enough for AMD engineers to reliably support tensor multiplication then there was no way anyone else was going to be able to do it. It isn't like anyone is confused about multiplying matricies together, it isn't for everyone but the naive algorithm is a core undergrad topic and the advanced algorithms surely aren't that crazy to implement. It was never a library problem.
I would be interested to know what speeds you can get from gemma4 26b + 31b from this machine. also how rocm compares to triton.
Nice. Thanks for the writeup. My Strix Halo machine is arriving next week. This is handy and helpful.
owning GGUF conversion step is good in sone circumstances, but running in fp16 is below optimal for this hardware due to low-ish bandwidth.
It looks like context is set to 32k which is the bare minimum needed for OpenCode with its ~10k initial system prompt. So overall, something like Unsloth's UD q8 XL or q6 XL quants free up a lot of memory and bandwidth moving into the next tier of usefulness.
Perfect. No fluff, just the minimum needed to get things working.
Thanks for sharing. However, this missed being a good writeup due to lack of numbers and data.
I'll give a specific example in my feedback, You said:
``` so far, so good, I was able to play with PyTorch and run Qwen3.6 on llama.cpp with a large context window ```
But there are no numbers, results or output paste. Performance, or timings.
Anyone with ram can run these models, it will just be impracticably slow. The halo strix is for a descent performance, so you sharing numbers will be valuable here.
Do you mind sharing these? Thanks!
This is more of a “succeeding to get anywhere close to messing around” rather than “it works so now I can run some benchmarks” type of article.
To give benefit of doubt, author does state multiple times (including in the title) that these were "first impressions", so perhaps they should have mentioned something like "...In the next post, we'll explore performance and numbers" to avoid a cliffhanger situation, or do a part 1 (assuming the intention was to follow-up with a part 2).