To step back a bit, the device still has a filesystem on it, and the structures described here are files within the filesystem? Just you're able to write directly into them, bypassing the filesystem layer, because you've constrained yourself to writes that don't require updating other parts of the filesystem structure?
Author here. This is not a general argument against fsync; the design depends on SSD-only deployment, preallocated files, O_DIRECT, single-key atomicity, and device write guarantees.
I wonder why this is not more common. LVM is easy to set up, and it's already common to allocate volumes for things like disk images for VMs, so why not databases?
Because the speed increase is - on modern, properly tuned filesystems - surprisingly small, due to how RDBMS's manage their pool; by working on large container files, they avoid most of the filesystem overhead.
To step back a bit, the device still has a filesystem on it, and the structures described here are files within the filesystem? Just you're able to write directly into them, bypassing the filesystem layer, because you've constrained yourself to writes that don't require updating other parts of the filesystem structure?
> fsync doesn’t just sync the file’s data, it syncs every piece of metadata the file depends on: ... directory entry
Famously not, as the man page says.
It is also said later in the article:
> POSIX strictly requires a parent-directory fsync to make a newly created file’s existence durable.
So I'm not sure why the dirent sync is claimed earlier.
Author here. This is not a general argument against fsync; the design depends on SSD-only deployment, preallocated files, O_DIRECT, single-key atomicity, and device write guarantees.
Working with files is hard [1], and most of the complicity is from the fsync API. I am glad it can be eliminated from a kv storage engine.
[1] https://news.ycombinator.com/item?id=42805425
Almost full-circle back to when Oracle took over the entire volume and implemented its own filesystem.
I wonder why this is not more common. LVM is easy to set up, and it's already common to allocate volumes for things like disk images for VMs, so why not databases?
Because the speed increase is - on modern, properly tuned filesystems - surprisingly small, due to how RDBMS's manage their pool; by working on large container files, they avoid most of the filesystem overhead.