I'm a Norwegian, and I use the national library almost every day for searching through texts. They have truly one of the best working user interfaces (and functionality) for searching through the massive amounts of text.
> The Olivia system is an HPE Cray Supercomputing EX system, with 448 GPUs and 64,512 CPU cores.
Training a sovereign LLM with this meager hardware as opposed to a LORA on some open source model seems like a huge mistake and a potential red flag.
There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.
Which begs the question, whose money are they wasting - and why?
They successfully have made PoC finetunes before, so the next step is training fully fledged LLMs.
I don’t think they aim to anything worthwhile. The finetunes were incredibly broken. I’m guessing it’s more about having the method to do it. I’m not convinced it’s super useful but I’m not one to decide who gets to do what with the research funds.
One finetune I tried did make fun of humans expressing their feelings in the chat. Often.
One other finetune did hallucinate that it was a doctor and my baby had terrible diseases, every time I just wrote "hei" (with a generic neutral system prompt that likely triggered this behaviour though).
I think Olivia is big enough for what it’s used for. In my opinion it’s better to stay up to date and not waste too much money on hardware at the moment.
It may not be useful to anyone outside, but it's possible that one of the goals is institutional learning (that is, embedding the knowledge in how to build LLMs in an organization).
Even though it's nominally the national library behind this, they were probably chosen (as per the article) because they legally own and can use all NO material for this end. I'd guess researchers from related entities like unis will be involved in the process.
i18n language models are not area something frontier labs are focusing ton of resources on? ( certainly not in Norwegian)
The corpus of content in Norwegian - may not require very large clusters, or even if it does, this is best that the library could do, it would be certainly more than anyone else is investing in Norwegian models
SOTA models do not have the access to the quality of content that the national library does? The article mentions licensing with newspapers specifically, and the library has access to its own content archive.
English and Norwegian are not closely related language families, perhaps LoRA is not best approach?
I am curious if there is published research on how well localization works with LoRA depending on how far off the target language grammar/vocabulary is from English.
The largest problem is available training data actually.
They have already done experiments with dittrent sub 10b models with both fine-tuning and fully from scratch. And last I check the fully from scratch captured the language in a better way.
DeepSeek claims to have trained on something like 2k H800, this is ~0.5k GH200 … it’s not nothing. Sure they’re not going to _serve_ it at scale, but that’s not the point?
Also the line between “finetuning a base model” and “man this is a real good initialization” gets pretty blurry at scale.
And this is before anybody ever thought about optimizing the training process. (Currently it's just pytorch analyst-as-coder slop, with extremely overprovisioned quantizations, etc.)
The frontier models know Norwegian just fine. They can also adapt to Norwegian dialects, and even ape old Norwegian fairly well.
E.g. I had Claude describe the novel "De knyttede næver" from 1911 in Norwegian orthography ca. 1911, as it's a novel I've read, and it does a good job.
What it lacks is an understanding of Norwegian literature, culture and history. It had to look up "De knyttede næver", which was one of the best-selling Norwegian novels of its time before I'd get anything out of it (ChatGPT does better; in thinking mode in particular it gives a detailed summary).
While not exactly well known today, the author was a prominent newspaper journalist for decades, and the novel series is well enough known that e.g. there's a Norwegian singer that took his stage name after the protagonist, and it was covered in Norwegian papers and books for decades (partly because of controversy over the authors political views and how they coloured his novels), so it does feel like a reasonable test that reveals a quite significant knowledge gap.
I do agree with you that it'd be better if the data set from the national library was made more accessible, though it seems a major addition here is that they have a deal to train on copyrighted data locked away in their archives that they have limitations on the use of.
But even just making the out of copyright data in their collections would be a great start.
As a Norwegian this sounds like a mistake. Who will use this LLM? Where? For what? The underlying data could be made more easily searchable and digestible for agents in general if the goal is better knowledge of Norwegian culture.
That said, they are quite limited in what they are allowed to share of in-copyright works, and nb.no is a fantastic resource as it is (though you'll need a Norwegian IP address for too much of it - it's one of th main reasons I maintain a VPN) - if they are allowed to make it accessible there, it'd be great.
But they also have vast amounts of out-of-copyright data that I hope they'd make more easily accessible...
Exactly, if there's one thing transformers are good at it's translation. One I've found particularly nice: any question ChatGPT can answer in English it can answer in French. I'm assuming Norwegian too. So there's no point.
Yes transformers are great at translation as that is their purpose.
LLMs are not great at preserving cultural uniqueness and diversity. Take how “delve” has reentered the lexicon because the human assessors for pre training dialect of English uses “delve” a lot.
There is a lot of benefits to training specifically for a unique culture with unique norms to preserve the culture as we increasingly rely on LLMs.
The point is that norway willl have its own LLM. And will not have dependencies to another state or private company. The goal is not to be the best model. But to have a model that include more Norwegian data then other LLM and that it's not screwed against other sources.
>As Husnes put it; Norway is a small country solving a problem every non-English-speaking nation will face: how do you build AI that reflects your language, your culture and your history? AI needs custodians, not just builders.
I'm afraid the answer is, mostly you don't.
Such a thing requires strong political will that, at least in my environment, seems basically impossible to align.
The costs are prohibitive, but beyond that, the type of person who cares about local representation like that is either completely fine with letting foreign companies implement it (after all, you can use ChatGPT in Basque if you want to) or is against the idea of AI altogether.
I'm sure if Norway approached the American labs with goal of making a curated datasets for training, they would absolutely get in the training door, and those models would likely run circles around anything that could be domestically done.
That being said though, I can feel you cringing through the screen.
This can’t be right. 2 PB of flash is like $200k. It’s within reach of many individuals. Then again I guess you don’t need that much storage so maybe it is.
What's special about it is not the flash but training an LLM based on the content, much of which is still in copyright and which the library has restrictions on how they are allowed to use (irrespective of the legal position of training on it) and which required an agreement with the copyright holders.
Boy pricing is pretty nuts these days. I have half a petabyte in Seagate enterprise drives myself and I didn’t pay anything close to that to acquire it. Such a pity about the flash storage. 2 years ago we built 200 TiB or something of flash using Samsung PM1633 or something and it was a fraction of the cost per gigabyte that $1m would imply.
That's about 350MB per capita. Humans can produce 2-6kb per hour. That's 13 years of non-stop typing. Wonder where it all comes from. I guess it's websites that aren't compressed / extracted.
> He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.
I don’t know this is true. But whatever sounds true enough and gets funding seems to be what flies these days.
Can confirm that. Norway may have a small population, but if you live there you'll think it's truly the center of the world (aside from the US. Norwegians love America)
5x 400gbit running to a 2U box whoa, the PCI lanes must have heat shielding.
More seriously there is a sensibility limit on extreme density where it's not needed. The idea that you're just going to magically get 2 TBit/s out of those ports seems unlikely even with tweaked software, and you're stuck with a power and comms hotspot that's liable to dictate the remainder of your network design.
At max utilisation that 2U would take 12 hours to drain, and only 12 hours assuming peak and likely unachievable throughput and the box otherwise being completely out of service. Not a great start
if you read the article 2pb is available as flash storage in the data pipeline, used to dedupe, clean, normalize, etc, for training from 60pb of raw data.
I'm a Norwegian, and I use the national library almost every day for searching through texts. They have truly one of the best working user interfaces (and functionality) for searching through the massive amounts of text.
> The Olivia system is an HPE Cray Supercomputing EX system, with 448 GPUs and 64,512 CPU cores.
Training a sovereign LLM with this meager hardware as opposed to a LORA on some open source model seems like a huge mistake and a potential red flag.
There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.
Which begs the question, whose money are they wasting - and why?
They successfully have made PoC finetunes before, so the next step is training fully fledged LLMs.
I don’t think they aim to anything worthwhile. The finetunes were incredibly broken. I’m guessing it’s more about having the method to do it. I’m not convinced it’s super useful but I’m not one to decide who gets to do what with the research funds.
One finetune I tried did make fun of humans expressing their feelings in the chat. Often.
One other finetune did hallucinate that it was a doctor and my baby had terrible diseases, every time I just wrote "hei" (with a generic neutral system prompt that likely triggered this behaviour though).
I think Olivia is big enough for what it’s used for. In my opinion it’s better to stay up to date and not waste too much money on hardware at the moment.
It may not be useful to anyone outside, but it's possible that one of the goals is institutional learning (that is, embedding the knowledge in how to build LLMs in an organization).
Even though it's nominally the national library behind this, they were probably chosen (as per the article) because they legally own and can use all NO material for this end. I'd guess researchers from related entities like unis will be involved in the process.
> this meager hardware
> they wasting - and why?
i18n language models are not area something frontier labs are focusing ton of resources on? ( certainly not in Norwegian)
The corpus of content in Norwegian - may not require very large clusters, or even if it does, this is best that the library could do, it would be certainly more than anyone else is investing in Norwegian models
SOTA models do not have the access to the quality of content that the national library does? The article mentions licensing with newspapers specifically, and the library has access to its own content archive.
English and Norwegian are not closely related language families, perhaps LoRA is not best approach?
I am curious if there is published research on how well localization works with LoRA depending on how far off the target language grammar/vocabulary is from English.
The largest problem is available training data actually.
They have already done experiments with dittrent sub 10b models with both fine-tuning and fully from scratch. And last I check the fully from scratch captured the language in a better way.
DeepSeek claims to have trained on something like 2k H800, this is ~0.5k GH200 … it’s not nothing. Sure they’re not going to _serve_ it at scale, but that’s not the point?
Also the line between “finetuning a base model” and “man this is a real good initialization” gets pretty blurry at scale.
Altogether a pretty presumptuous take.
That's what they have access to right now. I am sure that will change in the future as the project progresses.
What do you suggest, that they stop and wait until they have the right HW?
> meager hardware
Qwen was made on a cluster about that size.
And this is before anybody ever thought about optimizing the training process. (Currently it's just pytorch analyst-as-coder slop, with extremely overprovisioned quantizations, etc.)
I wonder if instead (or in parallel), Norway should build a set of training data and share it (for free) with all the model builders.
Seems like making the frontier models know Norwegian and their culture is a better (or additional!) way to reach the end they are going for here.
The frontier models know Norwegian just fine. They can also adapt to Norwegian dialects, and even ape old Norwegian fairly well.
E.g. I had Claude describe the novel "De knyttede næver" from 1911 in Norwegian orthography ca. 1911, as it's a novel I've read, and it does a good job.
What it lacks is an understanding of Norwegian literature, culture and history. It had to look up "De knyttede næver", which was one of the best-selling Norwegian novels of its time before I'd get anything out of it (ChatGPT does better; in thinking mode in particular it gives a detailed summary).
While not exactly well known today, the author was a prominent newspaper journalist for decades, and the novel series is well enough known that e.g. there's a Norwegian singer that took his stage name after the protagonist, and it was covered in Norwegian papers and books for decades (partly because of controversy over the authors political views and how they coloured his novels), so it does feel like a reasonable test that reveals a quite significant knowledge gap.
I do agree with you that it'd be better if the data set from the national library was made more accessible, though it seems a major addition here is that they have a deal to train on copyrighted data locked away in their archives that they have limitations on the use of.
But even just making the out of copyright data in their collections would be a great start.
As a Norwegian this sounds like a mistake. Who will use this LLM? Where? For what? The underlying data could be made more easily searchable and digestible for agents in general if the goal is better knowledge of Norwegian culture.
I agree in principle.
That said, they are quite limited in what they are allowed to share of in-copyright works, and nb.no is a fantastic resource as it is (though you'll need a Norwegian IP address for too much of it - it's one of th main reasons I maintain a VPN) - if they are allowed to make it accessible there, it'd be great.
But they also have vast amounts of out-of-copyright data that I hope they'd make more easily accessible...
Hard disagree. This is the first step not the last and proves to other countries that this can be done.
Exactly, if there's one thing transformers are good at it's translation. One I've found particularly nice: any question ChatGPT can answer in English it can answer in French. I'm assuming Norwegian too. So there's no point.
Yes transformers are great at translation as that is their purpose.
LLMs are not great at preserving cultural uniqueness and diversity. Take how “delve” has reentered the lexicon because the human assessors for pre training dialect of English uses “delve” a lot.
There is a lot of benefits to training specifically for a unique culture with unique norms to preserve the culture as we increasingly rely on LLMs.
https://www.scientificamerican.com/article/chatgpt-is-changi...
There's quite a bit more to culture and language than just being able to have transformers come up with believable language and/or dialect.
The point is that norway willl have its own LLM. And will not have dependencies to another state or private company. The goal is not to be the best model. But to have a model that include more Norwegian data then other LLM and that it's not screwed against other sources.
Model can speak Lithuanian too, but with a Russian accent which is a big taboo for us.
They're only good at it because they were trained on massive amounts of English and French data.
>As Husnes put it; Norway is a small country solving a problem every non-English-speaking nation will face: how do you build AI that reflects your language, your culture and your history? AI needs custodians, not just builders.
I'm afraid the answer is, mostly you don't.
Such a thing requires strong political will that, at least in my environment, seems basically impossible to align.
The costs are prohibitive, but beyond that, the type of person who cares about local representation like that is either completely fine with letting foreign companies implement it (after all, you can use ChatGPT in Basque if you want to) or is against the idea of AI altogether.
I'm sure if Norway approached the American labs with goal of making a curated datasets for training, they would absolutely get in the training door, and those models would likely run circles around anything that could be domestically done.
That being said though, I can feel you cringing through the screen.
This can’t be right. 2 PB of flash is like $200k. It’s within reach of many individuals. Then again I guess you don’t need that much storage so maybe it is.
Your numbers are a little off but the point remains- 2PB is nothing, not newsworthy imo. What’s special about this?
What's special about it is not the flash but training an LLM based on the content, much of which is still in copyright and which the library has restrictions on how they are allowed to use (irrespective of the legal position of training on it) and which required an agreement with the copyright holders.
More like $1M at current prices at this scale / level of performance.
If you go with HDD arrays probably $50k
Boy pricing is pretty nuts these days. I have half a petabyte in Seagate enterprise drives myself and I didn’t pay anything close to that to acquire it. Such a pity about the flash storage. 2 years ago we built 200 TiB or something of flash using Samsung PM1633 or something and it was a fraction of the cost per gigabyte that $1m would imply.
That's about 350MB per capita. Humans can produce 2-6kb per hour. That's 13 years of non-stop typing. Wonder where it all comes from. I guess it's websites that aren't compressed / extracted.
How about that, they actually asked for permission to use data and the companies said yes.
This is how much storage the average r/datahoarder user has in their basement. Fewer than 100 hard drives.
But not in flash. I have an appreciable fraction of that but in spinning rust.
> He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.
I don’t know this is true. But whatever sounds true enough and gets funding seems to be what flies these days.
They made the cultural case, you have no idea how strong this is in places like quebec, nordics, france, russia etc
Can confirm that. Norway may have a small population, but if you live there you'll think it's truly the center of the world (aside from the US. Norwegians love America)
Ehhh. None of this sounds right. Translation problems maybe. Lack or technical detail understanding maybe... I don't know. Probably not news.
384 core cpu cluster? 2 petabytes?
Dell just launched a 2U that fits almost 10 petabytes in it. It's probably not 384 core capable but that is very doable right now, Epyc chips are 192 cores each! https://www.techradar.com/pro/dell-launches-record-shatterin...
5x 400gbit running to a 2U box whoa, the PCI lanes must have heat shielding.
More seriously there is a sensibility limit on extreme density where it's not needed. The idea that you're just going to magically get 2 TBit/s out of those ports seems unlikely even with tweaked software, and you're stuck with a power and comms hotspot that's liable to dictate the remainder of your network design.
At max utilisation that 2U would take 12 hours to drain, and only 12 hours assuming peak and likely unachievable throughput and the box otherwise being completely out of service. Not a great start
That's the in-house preprocessing hardware, not what they're training on.
2 PB? They will not come close to training in on that amount. Maybe years from now.
Think they will not train on the dull 2TB but use that as the data lake to start and then apply a more targeted approach.
if you read the article 2pb is available as flash storage in the data pipeline, used to dedupe, clean, normalize, etc, for training from 60pb of raw data.
Could probably LoRA with that
Ad for Huawei?