It is a weirdly common misconception that that fork() is cheap... it is O(N) on the size of the process, and it always has been.
Yes, it's copy on write... but there is a linear relationship between the size of the process and the number of page table entries required to represent it.
> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.
> As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.
This paper is great and I also really like one of its references [29] as it goes into some more subtle parts of scalable interfaces, including fork. It's a gem IMO: The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors https://people.csail.mit.edu/nickolai/papers/clements-sc.pdf
"Our pool has a high concentration of piss in the water, which is not healthy. I think we should take efforts to remove piss from the water."
"Why? The pool over there has radioactive sludge in the water. Their pool is less healthy than ours despite not having piss in the water, so there is clearly no health benefit to removing piss from the water."
That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.
> fork() is a relatively expensive system call; it must copy the entire process state (including memory) for the child process. Many optimizations have been made over the years, but a fork is still a fundamentally costly operation. To make things worse, a fork() call is often immediately followed by an exec(), which will discard all of that memory that was so carefully copied for the child.
It's weird to leave out a mention of copy-on-write - the optimisation that means that you don't copy over all the memory.
This was left implicit in the article, but what they mean by copying the process state here is the memory management structures. That's mainly the page tables and the VMAs.
That means you have to allocate new pages to hold a copy of all these structures, even if the actual memory pointed by the pages is shared. And walking all those structures to make a copy is still costly.
Even with copy-on-write, fork() still has to pay the setup cost for COW. If the parent process has a lot of busy threads (e.g. Java), you can end up doing a lot of unnecessary COW before exec() fires.
It says state. Copy on write still means it's O(number of page table entries) even if you don't copy the contents. It's a well known issue that forking a program with large virtual memory size is slow.
I just ran into this recently, where I had an obscure bug caused by needing to close more file descriptors in the forked process. "I want a clone of the current process" is just way less common in my experience than "I want a completely new process". It feels crazy that we don't have a way to directly express the latter thing, and can only approximate it by cloning and then fixing things up in post.
A thing that makes that complicated is that while you want that conceptually, you don't want that in reality. For instance, if the spawning process is in a container of some sort and it spawned a process that "shares nothing with the process that spawned it", the spawned process would no longer be in that container, because the state of "being in the container" is one of the things it shares with the parent process.
This is just an example of I don't even know how many things a modern-day process will share from its parent.
By "complicated" I do not even remotely mean "unsolvable". I just mean that if you really dig down into what it means to "share nothing" in a modern operating system, it's a lot richer than it was back when fork+exec was a practical solution. There's a lot of fuzzy things that could go either way when you say "shares nothing".
But you generally want to communicate with that process, so you do need to setup e.g. file descriptors and stuff, which needs information from the parent process to be passed.
Most programming languages abstract this out to be able to connect or drop the 3 standard pipes. Typically this is the only thing that can be shared anyway unless the other program is specifically shared and expects other file handles to be available, in which case fork might be the right system call anyway.
There's a bunch of nastiness around that too. If you have e.g. library state that assumes the fd still works you can get her very confusing bugs once another file is opened into that fd number...
The elegance of the fork() + exec() model is that every kind of configuration can be done after the fork using all the usual APIs. Every attempt to replace it with a combined call that I have seen so far seemed fundamentally poorer because it needs to add all configuration options as parameters to the call and then do this in away that you can extend it later and does not become a mess.
I have the entirely opposite opinion. IMO a big mistake of the UNIXy model is that so much state is preserved across the creation of a process. For example, there are APIs to have a specific thing be fd number 4 so you can run a program and have it find that thing at fd 4. This is weird.
Windows, for all its many, many faults, did not use fork+exec and instead mostly has options for how one creates a process. It wasn’t done elegantly, but it was the right decision.
Is it weirder, that you can pass an variable precisely into argument 4? You do need to pass information to a subprocess and there needs to be some agreement on what means what. Sure, maybe you could use names instead of fds, but that sounds needlessly complicated.
That’s like saying you could use positions to specify function argument access (as in assembly) instead of variable names. File descriptors being numbers that are likely array indexes in a file handle seems like a leaky abstraction. Having a namespace that a parent process share with its children seems like a much cleaner design.
A way to pass a defined list of handles to a subprocess (or a friendly other process) makes sense. Having that mechanism be direct inheritance of those handles with the same numbering as the source is obnoxious.
You're simply failing to grasp the value of the simplicity, compatibility, and portability of POSIX/*nix. Inventing yet another way to create a process would be complex and break things. It's a-la-carte to enable configuration after fork of the new CoW or non-CoW process but before exec (unless vfork or similar were used). This is the model.
If you want to greenfield re-engineer the world with all new system calls and a totally different execution model, feel free to go right ahead.
That's mostly papering over design mistake that most syscalls doesn't accept target pid. Otherwise you could just create suspended process, configure it with syscalls that explicitly take target pid, and start it.
Yeah. The right way to eliminate fork() is to make the usual APIs that modify process state take an explicit process handle, so the same APIs can be used to set up an empty process. They can also be composed in other ways, eg for IPC or debugging.
I'm not surprised Chen's patch was rejected; that's an extremely niche usecase not worth supporting. With my shell developer hat on, I agree with the closing "developers would likely welcome a native implementation that isn't (unlike the current implementation) hiding fork() and exec() under the covers".
There are a lot of slightly different fork-exec-like things in the concept space and it's hard to imagine one approach satisfying them all. IMO it would be interesting to take an approach analogous-ish to sched_ext_ops where you built the rough flow chart of a combined fork-exec, but with hooks built to enable ebpf to change behavior or skip the bits these sophisticated users don't want/need.
Fork/exec is great if you actually want the traditional copy of your process for some reason.
For launching something totally new, like the example in the article of some tool calling git, I think it does make a ton of sense to make something new.
Especially since I suspect that is by far the more common case. I suspect “I want a clone of me“ is relatively rarely used at this point.
This seems unnecessary to me. In the example, the core of git should be a library yo can link so you don't need to run the binary. That would be better in every way.
But when you use a process, you get tons of things for free, the subtask is invoked in parallel, you get isolation and you can control execution for free. Unless you are already writing a multithreaded program or already accept passing objects in memory, using a process is actually easier to write than using a library.
If I use a library, I also need to start using threads and need to invent some core synchronization mechanism. I essentially are reinventing a small scheduler, when I already get this from the OS for free. Also know any crash in the third-party code will crash the whole program, the third-party code has access to the whole address space. With invoking a process you also have a standardized API implemented by the OS.
I'm guessing that a big part of the problem with moving away from fork() in general is that each new process needs a copy of the parent process' environment anyway, right?
The LWN article is incorrect in saying that it "must copy the entire process state (including memory) for the child process". There are some kernel structures and page tables that need to be initialized, plus you need a new stack, but it's not nearly as dramatic as implied. Most of the parent's memory is "incorporated by reference", so to speak.
In fact, if you profile it, in the fork() + execve() model, execve() is far more expensive, because not only does it replace the old process with a new one, but it also involves running the dynamic linker, which opens, parses, and mmaps library files.
It still makes sense to get rid of the fork() overhead if you're going to throw away the cloned process state soon thereafter, but if you wanted to make process execution radically faster, rethinking the exec architecture would probably offer more significant gains.
Fork becomes more and more expensive the higher the RSS of the process, roughly 1ms per 1Gb of the process size with 4kb pages. Given that modern servers can easily support 1-2Tb of RAM the fork() part can easily take several hundred milliseconds, blocking everything in the meantime. So for larger programs you kinda have to have a "fork helper" process if you need to execute external programs for some reason.
The kernel does not copy every page, but it does have to copy all of the VMAs. Setting memory to COW (which can involve changing a lot of page-table-entries) is not free either. I guess I could have mentioned copy-on-write explicitly, but I do not believe that what I wrote was incorrect.
I'm a bit naive, but I don't think that's necessarily a requirement.
It might be commonly held convention, and thus, an assumption, in Linux (and, broadly, UNIX) but I don't think it's true inside VAX or even Windows, so I don't think it's a requirement.
Unless I've missed something (which is totally possible, this is not an area of OS design I've spent much time).
But also UID, groups, controlling TTY, process group, capabilities, pipes, shared memory, etc. and the file descriptors while maybe not inherently needed are how a lot of Unix plumbing works.
A lot of times you actively don't want the parent environment or any of the memory or file descriptors. And then you have to actively do work to fix all that stuff up after the fork.
Maybe tangentially related but I always think it's silly that every linux process has the same libgcc_so.so.1 loaded into memory for each process even though the raw binary for the library is exactly the same so you end up with like 800 copies of libgcc_so.so.1 in memory.
I mean maybe this has been optimized for already and I don't know what I'm talking about but maybe someone with more knowledge about the kernel knows? Is this something we simply can't optimize for because of security implications?
Shared libraries (and mmapped files in general) are deduplicated; it's nowhere near as bad as you think. The kernel loads a .so into memory once and then maps that memory into every process that mmaps it.
Editing to add: this deduplication is one of the greatest upsides to dynamic linking. Common libs like libgcc and libc only have to exist in memory once and can stay in CPU caches, whereas if they were statically linked into every binary, each binary would have a copy of that library that wouldn't be shared with anything else and you'd waste a lot of memory.
Typically libgcc_so.so is loaded by the linker, which uses an mmap call to map the binary into the address space.
> The kernel keeps track of which file is mapped where, and can detect when a request is made to map an already mapped file again, avoiding physical memory allocation if possible.
In Linux, when a shared lib is loaded by multiple processes, its loaded once and not duplicated in ram. Only if a memory page is modified by the process will the memory be duplicated. (Hope I have explained that correctly)
Those mappings by default all go to the same shared memory.
Unices have been sharing executable memory between processes longer than there's been mmap for user space to do the same thing themselves. I remember seeing it in the 2BSD kernel for instance.
I have a rule for myself. If I think something is silly or stupid, I assume I don't understand it. I usually find I do not understand it, and it no longer seems silly when I do understand it.
In this case too, you think it is silly because you don't understand it. Your assumptions are wrong, making it seem silly.
It is a weirdly common misconception that that fork() is cheap... it is O(N) on the size of the process, and it always has been.
Yes, it's copy on write... but there is a linear relationship between the size of the process and the number of page table entries required to represent it.
Related to the discussion: "A fork() in the road": https://www.microsoft.com/en-us/research/wp-content/uploads/...
> ABSTRACT
> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.
> As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.
This paper is great and I also really like one of its references [29] as it goes into some more subtle parts of scalable interfaces, including fork. It's a gem IMO: The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors https://people.csail.mit.edu/nickolai/papers/clements-sc.pdf
It is somewhat interesting that the most widely used "big" OS that doesn't use fork, i.e. Windows, has dog slow process creation...
I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.
This has literally no relation to anything.
"Our pool has a high concentration of piss in the water, which is not healthy. I think we should take efforts to remove piss from the water."
"Why? The pool over there has radioactive sludge in the water. Their pool is less healthy than ours despite not having piss in the water, so there is clearly no health benefit to removing piss from the water."
That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.
Because that OS best practices is to use threads.
Traditionally Windows applications that create processes all the time come from UNIX heritage.
Contrary to UNIX, Windows NT was designed with threads first mentality from the get go.
Discussion at the time:
https://news.ycombinator.com/item?id=19621799 - A fork() in the road (2019-04-10, 178 comments)
Fork is marvelous for the zygote pattern
Hard to come up with an optimization that is equally efficient and elegant
> fork() is a relatively expensive system call; it must copy the entire process state (including memory) for the child process. Many optimizations have been made over the years, but a fork is still a fundamentally costly operation. To make things worse, a fork() call is often immediately followed by an exec(), which will discard all of that memory that was so carefully copied for the child.
It's weird to leave out a mention of copy-on-write - the optimisation that means that you don't copy over all the memory.
This was left implicit in the article, but what they mean by copying the process state here is the memory management structures. That's mainly the page tables and the VMAs.
That means you have to allocate new pages to hold a copy of all these structures, even if the actual memory pointed by the pages is shared. And walking all those structures to make a copy is still costly.
Even with copy-on-write, fork() still has to pay the setup cost for COW. If the parent process has a lot of busy threads (e.g. Java), you can end up doing a lot of unnecessary COW before exec() fires.
It says state. Copy on write still means it's O(number of page table entries) even if you don't copy the contents. It's a well known issue that forking a program with large virtual memory size is slow.
I just ran into this recently, where I had an obscure bug caused by needing to close more file descriptors in the forked process. "I want a clone of the current process" is just way less common in my experience than "I want a completely new process". It feels crazy that we don't have a way to directly express the latter thing, and can only approximate it by cloning and then fixing things up in post.
What do you mean by "a completely new process"?
The equivalent of CreateProcessW https://learn.microsoft.com/en-us/windows/win32/api/processt...
A process that shares nothing with the process that spawned it.
A thing that makes that complicated is that while you want that conceptually, you don't want that in reality. For instance, if the spawning process is in a container of some sort and it spawned a process that "shares nothing with the process that spawned it", the spawned process would no longer be in that container, because the state of "being in the container" is one of the things it shares with the parent process.
This is just an example of I don't even know how many things a modern-day process will share from its parent.
By "complicated" I do not even remotely mean "unsolvable". I just mean that if you really dig down into what it means to "share nothing" in a modern operating system, it's a lot richer than it was back when fork+exec was a practical solution. There's a lot of fuzzy things that could go either way when you say "shares nothing".
That’s how you get zombie processes and memory leaks.
But you generally want to communicate with that process, so you do need to setup e.g. file descriptors and stuff, which needs information from the parent process to be passed.
Most programming languages abstract this out to be able to connect or drop the 3 standard pipes. Typically this is the only thing that can be shared anyway unless the other program is specifically shared and expects other file handles to be available, in which case fork might be the right system call anyway.
Isn't that covered by O_CLOEXEC?
There's a bunch of nastiness around that too. If you have e.g. library state that assumes the fd still works you can get her very confusing bugs once another file is opened into that fd number...
The elegance of the fork() + exec() model is that every kind of configuration can be done after the fork using all the usual APIs. Every attempt to replace it with a combined call that I have seen so far seemed fundamentally poorer because it needs to add all configuration options as parameters to the call and then do this in away that you can extend it later and does not become a mess.
I have the entirely opposite opinion. IMO a big mistake of the UNIXy model is that so much state is preserved across the creation of a process. For example, there are APIs to have a specific thing be fd number 4 so you can run a program and have it find that thing at fd 4. This is weird.
Windows, for all its many, many faults, did not use fork+exec and instead mostly has options for how one creates a process. It wasn’t done elegantly, but it was the right decision.
Is it weirder, that you can pass an variable precisely into argument 4? You do need to pass information to a subprocess and there needs to be some agreement on what means what. Sure, maybe you could use names instead of fds, but that sounds needlessly complicated.
That’s like saying you could use positions to specify function argument access (as in assembly) instead of variable names. File descriptors being numbers that are likely array indexes in a file handle seems like a leaky abstraction. Having a namespace that a parent process share with its children seems like a much cleaner design.
A way to pass a defined list of handles to a subprocess (or a friendly other process) makes sense. Having that mechanism be direct inheritance of those handles with the same numbering as the source is obnoxious.
You're simply failing to grasp the value of the simplicity, compatibility, and portability of POSIX/*nix. Inventing yet another way to create a process would be complex and break things. It's a-la-carte to enable configuration after fork of the new CoW or non-CoW process but before exec (unless vfork or similar were used). This is the model.
If you want to greenfield re-engineer the world with all new system calls and a totally different execution model, feel free to go right ahead.
That's mostly papering over design mistake that most syscalls doesn't accept target pid. Otherwise you could just create suspended process, configure it with syscalls that explicitly take target pid, and start it.
Yeah. The right way to eliminate fork() is to make the usual APIs that modify process state take an explicit process handle, so the same APIs can be used to set up an empty process. They can also be composed in other ways, eg for IPC or debugging.
I'm not surprised Chen's patch was rejected; that's an extremely niche usecase not worth supporting. With my shell developer hat on, I agree with the closing "developers would likely welcome a native implementation that isn't (unlike the current implementation) hiding fork() and exec() under the covers".
It sounds like they're interested in the concept though, just not that specific implementation.
Yeah this seems like a promising discussion.
The most astonishing part is that this is dated June 5th, 2026.
I.e. a year that starts with 20, not 19.
There is lots of discussion on this old API here on hacker news, for instance https://news.ycombinator.com/item?id=31739794
There are a lot of slightly different fork-exec-like things in the concept space and it's hard to imagine one approach satisfying them all. IMO it would be interesting to take an approach analogous-ish to sched_ext_ops where you built the rough flow chart of a combined fork-exec, but with hooks built to enable ebpf to change behavior or skip the bits these sophisticated users don't want/need.
Fork/exec is great if you actually want the traditional copy of your process for some reason.
For launching something totally new, like the example in the article of some tool calling git, I think it does make a ton of sense to make something new.
Especially since I suspect that is by far the more common case. I suspect “I want a clone of me“ is relatively rarely used at this point.
This seems unnecessary to me. In the example, the core of git should be a library yo can link so you don't need to run the binary. That would be better in every way.
But when you use a process, you get tons of things for free, the subtask is invoked in parallel, you get isolation and you can control execution for free. Unless you are already writing a multithreaded program or already accept passing objects in memory, using a process is actually easier to write than using a library.
If I use a library, I also need to start using threads and need to invent some core synchronization mechanism. I essentially are reinventing a small scheduler, when I already get this from the OS for free. Also know any crash in the third-party code will crash the whole program, the third-party code has access to the whole address space. With invoking a process you also have a standardized API implemented by the OS.
There are lots of reasons to want to spawn fresh processes, which aren't solved by linking a library.
Sure, but not many times a second
Every build system ever says hello.
Spawning processes should not be on the hot path of any program.
It ends up on the hot path of programs that use process isolation aggressively
Why? That's a very useful processing primitive.
It’s a hack with many disadvantages. Sometimes a hack is the right answer, but the kernel should it add a primitive for it.
Should bash link in every program the user might want? Load them up as dynamic libraries?
I'm guessing that a big part of the problem with moving away from fork() in general is that each new process needs a copy of the parent process' environment anyway, right?
The LWN article is incorrect in saying that it "must copy the entire process state (including memory) for the child process". There are some kernel structures and page tables that need to be initialized, plus you need a new stack, but it's not nearly as dramatic as implied. Most of the parent's memory is "incorporated by reference", so to speak.
In fact, if you profile it, in the fork() + execve() model, execve() is far more expensive, because not only does it replace the old process with a new one, but it also involves running the dynamic linker, which opens, parses, and mmaps library files.
It still makes sense to get rid of the fork() overhead if you're going to throw away the cloned process state soon thereafter, but if you wanted to make process execution radically faster, rethinking the exec architecture would probably offer more significant gains.
Fork becomes more and more expensive the higher the RSS of the process, roughly 1ms per 1Gb of the process size with 4kb pages. Given that modern servers can easily support 1-2Tb of RAM the fork() part can easily take several hundred milliseconds, blocking everything in the meantime. So for larger programs you kinda have to have a "fork helper" process if you need to execute external programs for some reason.
The kernel does not copy every page, but it does have to copy all of the VMAs. Setting memory to COW (which can involve changing a lot of page-table-entries) is not free either. I guess I could have mentioned copy-on-write explicitly, but I do not believe that what I wrote was incorrect.
I'm a bit naive, but I don't think that's necessarily a requirement.
It might be commonly held convention, and thus, an assumption, in Linux (and, broadly, UNIX) but I don't think it's true inside VAX or even Windows, so I don't think it's a requirement.
Unless I've missed something (which is totally possible, this is not an area of OS design I've spent much time).
But also UID, groups, controlling TTY, process group, capabilities, pipes, shared memory, etc. and the file descriptors while maybe not inherently needed are how a lot of Unix plumbing works.
Even DOS has environment inheritance!
A lot of times you actively don't want the parent environment or any of the memory or file descriptors. And then you have to actively do work to fix all that stuff up after the fork.
the environment is not that big
Maybe tangentially related but I always think it's silly that every linux process has the same libgcc_so.so.1 loaded into memory for each process even though the raw binary for the library is exactly the same so you end up with like 800 copies of libgcc_so.so.1 in memory.
I mean maybe this has been optimized for already and I don't know what I'm talking about but maybe someone with more knowledge about the kernel knows? Is this something we simply can't optimize for because of security implications?
Shared libraries (and mmapped files in general) are deduplicated; it's nowhere near as bad as you think. The kernel loads a .so into memory once and then maps that memory into every process that mmaps it.
Editing to add: this deduplication is one of the greatest upsides to dynamic linking. Common libs like libgcc and libc only have to exist in memory once and can stay in CPU caches, whereas if they were statically linked into every binary, each binary would have a copy of that library that wouldn't be shared with anything else and you'd waste a lot of memory.
Doesn't the loaded code have to be patched for relocations?
It does, so not 100% is reused. The patched parts are in different sections though, so the entire .text (code) section ends up being reused.
Not on modern archs that provide decent support for PIE (position independent executables).
Not if it's position-independent.
Typically libgcc_so.so is loaded by the linker, which uses an mmap call to map the binary into the address space.
> The kernel keeps track of which file is mapped where, and can detect when a request is made to map an already mapped file again, avoiding physical memory allocation if possible.
Relevant stack overflow answer: https://stackoverflow.com/questions/61950951/linux-shared-li...
In Linux, when a shared lib is loaded by multiple processes, its loaded once and not duplicated in ram. Only if a memory page is modified by the process will the memory be duplicated. (Hope I have explained that correctly)
Those mappings by default all go to the same shared memory.
Unices have been sharing executable memory between processes longer than there's been mmap for user space to do the same thing themselves. I remember seeing it in the 2BSD kernel for instance.
Eh? Aren't shared libraries actually shared in memory?
Yeah, that's kind of the point.
I have a rule for myself. If I think something is silly or stupid, I assume I don't understand it. I usually find I do not understand it, and it no longer seems silly when I do understand it.
In this case too, you think it is silly because you don't understand it. Your assumptions are wrong, making it seem silly.
> "If you are repeatedly creating large processes, you are already doing it wrong. The fix is in user space, not the kernel."
Every couple of years, someone claims they have "the solution" implying everyone else who came before them didn't know what they were doing.