I agree with mjw1007 that lack of rigorous API documentation for error paths is a huge weakness and with you that it's not a Linux-only problem.
There are a lot of related filesystem robustness questions I'd love to get authoritative answers on. Neither the Single UNIX Specification nor OS-specific kernel docs / manpages gives enough information to write a robust, performant program, and certainly you can't find one place that gives everything you'd want to know when writing a portable program. For example:
* Does fsync() make guarantees about just the inode, or also the dirent? (iirc on Linux only the inode; for a freshly-created file you also have to fsync() the directory.)
* What does fsync() success guarantee is written to permanent storage? from this whole thing, apparently on Linux recently (even ignoring the bugs) only things since a previous fsync() failure or the current open() call, whichever was later. Yuck. That's a terrible behavior, and even worse for being undocumented.
* Does it even guarantee that if you don't say "Simon says"? On macOS, I gather you need to do this extra F_FULLFSYNC thing. Are the other platforms like that? I dunno. And there are certainly mentions of older hard drives where nothing can be trusted. Is there any database of hard drive behavior? Stress test program to tell if I have a broken model?
* If you do a write and power is lost before fsync, what guarantees do you have about the current state? I was trying to figure out recently if an N-byte aligned overwrite is guaranteed to reflect the "old" or "new" states (for various Ns: 1, 512, 4096, st_blksize). The best I could find is here: <https://stackoverflow.com/a/2068608/23584> which suggests yes for N=512 "these days". Do I trust that? on all platforms? for hard drives made how recently? etc.
* If you create a file, write to it, and rename() it into place and power is lost before fsync, what guarantees do you have about the current state? is it guaranteed that the dirent points to either a previous inode (if any) or the new one? if it points to the new one, is the file guaranteed to have the write length or contents? the conservative thing to do would be to create, write, fsync() the file, fsync() the directory, rename, fsync() the directory again. But three syncs is getting ridiculous. Is it safe to remove one or more?
> the conservative thing to do would be to create, write, fsync() the file, fsync() the directory, rename, fsync() the directory again.
Afaik the conservative thing is the necessary thing if you're on an ext4 mounted with noauto_da_alloc,data=writeback. I think you can skip the last fsync if you're fine with losing the new version as long as you get the old version in its place.
Thanks for mentioning those options. I found a little more about the in the ext4 manpage.
Things about it a little more, I'd expect I could skip the first directory fsync I'd mentioned. Surely the rename can't make it to the directory without the creation getting there, too...
Anyway, I feel like I've could come up with a list of questions 10x as long as the one I just gave, but you'd never really get answers, even for a particular drive, os, fs combo, without expensive testing or source code digging.
But in the end you should code against what posix guarantees, not what particular filesystems happen to do, because the next filesystem might use some other leeway the spec provides.
There are a lot of related filesystem robustness questions I'd love to get authoritative answers on. Neither the Single UNIX Specification nor OS-specific kernel docs / manpages gives enough information to write a robust, performant program, and certainly you can't find one place that gives everything you'd want to know when writing a portable program. For example:
* Does fsync() make guarantees about just the inode, or also the dirent? (iirc on Linux only the inode; for a freshly-created file you also have to fsync() the directory.)
* What does fsync() success guarantee is written to permanent storage? from this whole thing, apparently on Linux recently (even ignoring the bugs) only things since a previous fsync() failure or the current open() call, whichever was later. Yuck. That's a terrible behavior, and even worse for being undocumented.
* Does it even guarantee that if you don't say "Simon says"? On macOS, I gather you need to do this extra F_FULLFSYNC thing. Are the other platforms like that? I dunno. And there are certainly mentions of older hard drives where nothing can be trusted. Is there any database of hard drive behavior? Stress test program to tell if I have a broken model?
* If you do a write and power is lost before fsync, what guarantees do you have about the current state? I was trying to figure out recently if an N-byte aligned overwrite is guaranteed to reflect the "old" or "new" states (for various Ns: 1, 512, 4096, st_blksize). The best I could find is here: <https://stackoverflow.com/a/2068608/23584> which suggests yes for N=512 "these days". Do I trust that? on all platforms? for hard drives made how recently? etc.
* If you create a file, write to it, and rename() it into place and power is lost before fsync, what guarantees do you have about the current state? is it guaranteed that the dirent points to either a previous inode (if any) or the new one? if it points to the new one, is the file guaranteed to have the write length or contents? the conservative thing to do would be to create, write, fsync() the file, fsync() the directory, rename, fsync() the directory again. But three syncs is getting ridiculous. Is it safe to remove one or more?