which means that when we do the write in that for-loop, we’re guaranteed that the read which happens after in the while-loop must see the result of that write. So: we read an “a”, write an “a”, and when we get back to the while-loop we’re guaranteed that the next read will see that “a” we just wrote.
In other words, it's behaving like a FIFO that persists a history of everything that's been written to it, which also suggests that there are two pointers, one for the read position and another for the write position; all becomes clear when we realise that it's not reading and writing with the same file descriptor, but rather two file descriptors of one file, opened twice.
The tl;dr of why small sizes terminate and large sizes loop infinitely is this: when the size of data output is not enough to fill the write buffer, the read can reach the end of file (0 bytes read), and then it flushes the write buffer before terminating; but when the output size fills the write buffer, it gets flushed and ends up back in the input, causing the infinite loop.
The C stdio buffering layer doesn't guarantee that data written with a fwrite() will show up in a fread() from a different file handle that references the same file (and now that I think about it, that would seem to be quite difficult to guarantee.)
It's not overly difficult to guarantee, the UNIX/Linux kernel does it just fine. It does require more work in that the libc file cache will need to invalidate itself if the source file / socket changes state (either via atime if the filesystem supports it or via inotify/fnotify). Obviously invalidating the cache will lead to lower performance but more correct behaviour.
I did this before when using ripgrep, and accidentally created a several hundred GB file. Now ripgrep checks if this is the case and handles it gracefully: https://github.com/BurntSushi/ripgrep/pull/310
A really fun one is having `bash` execute its output using a FIFO. It worked when I tried it last, but it'd obviously be more than crazy to rely on it (for more than one reason.)
I'll try to find some time soon to port that C+inline assembly example to Linux, I didn't know much C or ASM back when I originally asked the question.
That's not due to sed, but the shell; the > operator always opens the output file for writing in create-if-not-exists, truncate-existing mode, and that redirection processing happens before any commands are executed.
sed -i also has the less obvious advantage that it is atomic.
(It writes to a temporary file, and then atomically renames it to the final location. Whereas if sed expr src > dst is killed, dst may be corrupt. You'd need to handle temporary file management yourself if your script needs to be reliable.)
My memory is to sed -i -e 's/pattern/replacement/' file, the addition of -e was perhaps subconsciously to help protect against this eventuality by mentally segregating my standard pipeline invocation which takes neither argument.
Anyway sed is so fast you should definitely check the output before overwriting stuff. Same goes for anything on the command line.
Maybe this is just me but when I enable that king of options I end up getting the automatism of bypassing them (like typing rm -f everytime if rm has been aliased to rm -i).
I prefer having a way of reverting changes rather than something that gets in the way.
As a sidenote, if you're using a command that doesn't have -i, or a chain of commands where the file is read at the start and written at the end, you can use sponge(1) from moreutils to achieve that without temporary files:
sed "s/root/toor/" /etc/passwd | grep -v joey | sponge /etc/passwd
ksh has ‘>;’, but bash hasn't picked that up. This better fits the compositional Unix philosophy than sprinkling a subset of individual commands with the functionality.
I don't know what you mean by ASCII name. It's two characters, ‘>’ (ASCII 03/14) followed by ‘;’ (03/11). I guess the ‘;’ is mnemonic for the sequential nature of the operation, i.e. renaming the temporary file after the command completes, as well as forming a combination that is a syntax error in sh.
When I said ASCII I actually meant alphanumerical, sorry. Basically I couldn't find any information about it on search engines so I was wandering if a had a name that I could use to google it.
I think the man page¹ is the only readily available documentation.
>;word Write output to a temporary file. If the command
completes successfully rename it to word, otherwise,
delete the temporary file. >;word cannot be used
with the exec(2) built-in.
Stupid question: in the cases where the input is a regular file, why doesn’t grep check the size of the file at the beginning? Is t actually desired behavior to search through stuff added to the file while we are processing the beginning of it?
The rules are that you open the file for reading until you get end-of-file. If you don’t follow the rules unexpected things go wrong. For instance it may not even be possible to get the file size.
In Unix we have the idiom of "Everything Is A File".
As such, when grep gets a file descriptor, it cannot always get the file-size, the file-size might be infinite, or might vary over time.
Instead, like other have said, the only way to know you've read 'until the end' is to read reports you've reached the end of the file.
Consider for example what should happen when grepping /dev/null/?
Or, a more sensible case, piping the output of some command to grep.
Grep will read from 'standard in' which is "Just A File" so it just calls read until it reports end of file.
> Consider for example what should happen when grepping /dev/null/
In the solution I gave, same thing as before because I said regular file. That is if you can open /dev/null for reading, which I thought you couldn't.
I think I'd be fine with saying that the infinite loop is the behavior that happens. GNU's grep obviously is doing some janky check which it shouldn't be. Consider what it will do when you get just slightly more clever and it can't determine the input/output types. The one on macOS has arbitrary file size-dependent behavior, which is problematic. Doesn't seem like either of them does a consistent thing.
But the article shows GNU grep actually looks ‘behind the curtain’ and identifies that input and output are the same file, then refuses to play. So everything is a file, but some are more filey than others.
The section about the guarantees of read/write doesn't seem entirely correct to me. For sure it's relevant to why the data is there, but the reason it doesn't terminate is just because read doesn't tell you there's an eof until you try to read again past the end. It would be entirely possible to construct a version of this loop that would terminate. Though it would be awkward.
In most cases, it is too late anyway at that point, as the shell redirection already truncated the file. Such a check might only help you realize you just deleted the text you wanted to process.
Redirecting (in simple terms) is setting a file descriptor. So the file descriptor can be tested if it partakes in the input.
A quote for you [0]:
Under normal circumstances every UNIX program has three streams opened for it when it starts up, one for input, one for output, and one for printing diagnostic or error messages. These are typically attached to the user's terminal (see tty(4) but might instead refer to files or other devices, depending on what the parent process chose to set up. (See also the "Redirection" section of sh(1).)
Which is really annoying when you want to pipe grep to less, and need to add --color=always to get grep to understand less takes color codes (when using less -R).
In other words, it's behaving like a FIFO that persists a history of everything that's been written to it, which also suggests that there are two pointers, one for the read position and another for the write position; all becomes clear when we realise that it's not reading and writing with the same file descriptor, but rather two file descriptors of one file, opened twice.
The tl;dr of why small sizes terminate and large sizes loop infinitely is this: when the size of data output is not enough to fill the write buffer, the read can reach the end of file (0 bytes read), and then it flushes the write buffer before terminating; but when the output size fills the write buffer, it gets flushed and ends up back in the input, causing the infinite loop.
The C stdio buffering layer doesn't guarantee that data written with a fwrite() will show up in a fread() from a different file handle that references the same file (and now that I think about it, that would seem to be quite difficult to guarantee.)