Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The Linux project takes the view « We don't attempt to rigorously document our API; instead we promise that if your program worked yesterday it will continue to work in the future. »

I think this story shows a weakness in that approach: for rarely-exercised error handling paths, it's too likely that your program didn't work yesterday and you had no easy way to know that.

(This is a separate issue from the fact that until recently the kernel implementation of fync itself had significant bugs, measured against what its maintainers thought ought to be guaranteed.)



Except that this issue (both the behavior and lack of exact docs) applies to other kernels, not just Linux. See https://wiki.postgresql.org/wiki/Fsync_Errors

So no, this is not just about Linux.


I agree with mjw1007 that lack of rigorous API documentation for error paths is a huge weakness and with you that it's not a Linux-only problem.

There are a lot of related filesystem robustness questions I'd love to get authoritative answers on. Neither the Single UNIX Specification nor OS-specific kernel docs / manpages gives enough information to write a robust, performant program, and certainly you can't find one place that gives everything you'd want to know when writing a portable program. For example:

* Does fsync() make guarantees about just the inode, or also the dirent? (iirc on Linux only the inode; for a freshly-created file you also have to fsync() the directory.)

* What does fsync() success guarantee is written to permanent storage? from this whole thing, apparently on Linux recently (even ignoring the bugs) only things since a previous fsync() failure or the current open() call, whichever was later. Yuck. That's a terrible behavior, and even worse for being undocumented.

* Does it even guarantee that if you don't say "Simon says"? On macOS, I gather you need to do this extra F_FULLFSYNC thing. Are the other platforms like that? I dunno. And there are certainly mentions of older hard drives where nothing can be trusted. Is there any database of hard drive behavior? Stress test program to tell if I have a broken model?

* If you do a write and power is lost before fsync, what guarantees do you have about the current state? I was trying to figure out recently if an N-byte aligned overwrite is guaranteed to reflect the "old" or "new" states (for various Ns: 1, 512, 4096, st_blksize). The best I could find is here: <https://stackoverflow.com/a/2068608/23584> which suggests yes for N=512 "these days". Do I trust that? on all platforms? for hard drives made how recently? etc.

* If you create a file, write to it, and rename() it into place and power is lost before fsync, what guarantees do you have about the current state? is it guaranteed that the dirent points to either a previous inode (if any) or the new one? if it points to the new one, is the file guaranteed to have the write length or contents? the conservative thing to do would be to create, write, fsync() the file, fsync() the directory, rename, fsync() the directory again. But three syncs is getting ridiculous. Is it safe to remove one or more?


> the conservative thing to do would be to create, write, fsync() the file, fsync() the directory, rename, fsync() the directory again.

Afaik the conservative thing is the necessary thing if you're on an ext4 mounted with noauto_da_alloc,data=writeback. I think you can skip the last fsync if you're fine with losing the new version as long as you get the old version in its place.


Thanks for mentioning those options. I found a little more about the in the ext4 manpage.

Things about it a little more, I'd expect I could skip the first directory fsync I'd mentioned. Surely the rename can't make it to the directory without the creation getting there, too...

Anyway, I feel like I've could come up with a list of questions 10x as long as the one I just gave, but you'd never really get answers, even for a particular drive, os, fs combo, without expensive testing or source code digging.


Yes, it's an area of research, https://www.usenix.org/system/files/conference/osdi14/osdi14...

But in the end you should code against what posix guarantees, not what particular filesystems happen to do, because the next filesystem might use some other leeway the spec provides.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: