Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What happens when you grep the file you've redirected grep to? (anniecherkaev.com)
218 points by luu on Jan 23, 2018 | hide | past | favorite | 49 comments


which means that when we do the write in that for-loop, we’re guaranteed that the read which happens after in the while-loop must see the result of that write. So: we read an “a”, write an “a”, and when we get back to the while-loop we’re guaranteed that the next read will see that “a” we just wrote.

In other words, it's behaving like a FIFO that persists a history of everything that's been written to it, which also suggests that there are two pointers, one for the read position and another for the write position; all becomes clear when we realise that it's not reading and writing with the same file descriptor, but rather two file descriptors of one file, opened twice.

The tl;dr of why small sizes terminate and large sizes loop infinitely is this: when the size of data output is not enough to fill the write buffer, the read can reach the end of file (0 bytes read), and then it flushes the write buffer before terminating; but when the output size fills the write buffer, it gets flushed and ends up back in the input, causing the infinite loop.

The C stdio buffering layer doesn't guarantee that data written with a fwrite() will show up in a fread() from a different file handle that references the same file (and now that I think about it, that would seem to be quite difficult to guarantee.)


It's not overly difficult to guarantee, the UNIX/Linux kernel does it just fine. It does require more work in that the libc file cache will need to invalidate itself if the source file / socket changes state (either via atime if the filesystem supports it or via inotify/fnotify). Obviously invalidating the cache will lead to lower performance but more correct behaviour.


I did this before when using ripgrep, and accidentally created a several hundred GB file. Now ripgrep checks if this is the case and handles it gracefully: https://github.com/BurntSushi/ripgrep/pull/310


A really fun one is having `bash` execute its output using a FIFO. It worked when I tried it last, but it'd obviously be more than crazy to rely on it (for more than one reason.)


Or for even more fun, self-modifying shell scripts:

https://stackoverflow.com/questions/3398258/edit-shell-scrip... (curiously, the accepted answer is incorrect and downvoted highly)

...and Windows batch files:

https://stackoverflow.com/questions/906586/changing-a-batch-...


Self modifying code is wandering into the (very interesting IMO) realms of metamorphic and polymorphic code. I started an SO questionable here: https://stackoverflow.com/questions/10113254/metamorphic-cod...

I'll try to find some time soon to port that C+inline assembly example to Linux, I didn't know much C or ASM back when I originally asked the question.


Reminds me of this old chestnut:

type foo.bat >> foo.bat


A variation you can run into with GNU grep:

  grep -r [something] . >foo
I've done this by accident once or twice :-)


The way I've done it:

  grep -r [something] * > foo
What I now do instead:

  grep -r [something] * > .foo
  mv .foo foo
Why it works: At least on bash, "*" does not match hidden files (files that begin with "." such as ".foo").


Reminds me of how easy it is to destroy a file with sed:

  sed 's/a/b/' test.txt > test.txt
After that command the file is empty no matter what was in there before. Instead you have to use the -i flag like:

  sed 's/a/b/' -i test.txt
And different operating systems seem to behave differently:

https://stackoverflow.com/questions/5171901/sed-command-find...


That's not due to sed, but the shell; the > operator always opens the output file for writing in create-if-not-exists, truncate-existing mode, and that redirection processing happens before any commands are executed.


sed -i also has the less obvious advantage that it is atomic.

(It writes to a temporary file, and then atomically renames it to the final location. Whereas if sed expr src > dst is killed, dst may be corrupt. You'd need to handle temporary file management yourself if your script needs to be reliable.)


My memory is to sed -i -e 's/pattern/replacement/' file, the addition of -e was perhaps subconsciously to help protect against this eventuality by mentally segregating my standard pipeline invocation which takes neither argument.

Anyway sed is so fast you should definitely check the output before overwriting stuff. Same goes for anything on the command line.

Get ZFS. Do snapshots.


> you should definitely check the output before overwriting stuff

That wouldn't really help in the case of redirecting sed to the same file.

    $ echo "a" > test.txt
    $ sed s/a/b/ test.txt
    b
    $ sed s/a/b/ test.txt > test.txt
    $ cat test.txt
    $


Indeed. Use >> in preference to > at all times ;)



Maybe this is just me but when I enable that king of options I end up getting the automatism of bypassing them (like typing rm -f everytime if rm has been aliased to rm -i).

I prefer having a way of reverting changes rather than something that gets in the way.


As a sidenote, if you're using a command that doesn't have -i, or a chain of commands where the file is read at the start and written at the end, you can use sponge(1) from moreutils to achieve that without temporary files:

  sed "s/root/toor/" /etc/passwd | grep -v joey | sponge /etc/passwd
https://joeyh.name/code/moreutils/


ksh has ‘>;’, but bash hasn't picked that up. This better fits the compositional Unix philosophy than sprinkling a subset of individual commands with the functionality.


Does >; have an ascii name? Symbol Hound isn't bringing up any useful results.


I don't know what you mean by ASCII name. It's two characters, ‘>’ (ASCII 03/14) followed by ‘;’ (03/11). I guess the ‘;’ is mnemonic for the sequential nature of the operation, i.e. renaming the temporary file after the command completes, as well as forming a combination that is a syntax error in sh.


When I said ASCII I actually meant alphanumerical, sorry. Basically I couldn't find any information about it on search engines so I was wandering if a had a name that I could use to google it.


I think the man page¹ is the only readily available documentation.

  >;word  Write output to a temporary file. If the command
          completes successfully rename it to word, otherwise,
          delete the temporary file. >;word cannot be used
          with the exec(2) built-in.
¹ e.g. https://www.freebsd.org/cgi/man.cgi?query=ksh93&apropos=0&se...


"apt install moreutils". Then look up the "sponge" tool


Stupid question: in the cases where the input is a regular file, why doesn’t grep check the size of the file at the beginning? Is t actually desired behavior to search through stuff added to the file while we are processing the beginning of it?


The rules are that you open the file for reading until you get end-of-file. If you don’t follow the rules unexpected things go wrong. For instance it may not even be possible to get the file size.

That is not suitable for a program like grep.


In Unix we have the idiom of "Everything Is A File". As such, when grep gets a file descriptor, it cannot always get the file-size, the file-size might be infinite, or might vary over time. Instead, like other have said, the only way to know you've read 'until the end' is to read reports you've reached the end of the file.

Consider for example what should happen when grepping /dev/null/? Or, a more sensible case, piping the output of some command to grep. Grep will read from 'standard in' which is "Just A File" so it just calls read until it reports end of file.


> Consider for example what should happen when grepping /dev/null/

In the solution I gave, same thing as before because I said regular file. That is if you can open /dev/null for reading, which I thought you couldn't.

I think I'd be fine with saying that the infinite loop is the behavior that happens. GNU's grep obviously is doing some janky check which it shouldn't be. Consider what it will do when you get just slightly more clever and it can't determine the input/output types. The one on macOS has arbitrary file size-dependent behavior, which is problematic. Doesn't seem like either of them does a consistent thing.


But the article shows GNU grep actually looks ‘behind the curtain’ and identifies that input and output are the same file, then refuses to play. So everything is a file, but some are more filey than others.


> And about 15 up-arrow+enters later

Gonna blow your mind here: Press up once and then Ctrl-O.


Nice, works for `/bin/bash` but not `zsh`. Still nice trick.


It works in at least version 5.2 of the Z shell, where Control-O is bound to accept-line-and-down-history.


The section about the guarantees of read/write doesn't seem entirely correct to me. For sure it's relevant to why the data is there, but the reason it doesn't terminate is just because read doesn't tell you there's an eof until you try to read again past the end. It would be entirely possible to construct a version of this loop that would terminate. Though it would be awkward.


>echo "a" > test.txt

test.txt will contain "a\n" not just "a"

-n to disable adding \n


good one..

when I checked on Linux, both `cat` and `grep` give error when input file name is same as output... but not `sed/awk/head/tail/sort/etc`..


In most cases, it is too late anyway at that point, as the shell redirection already truncated the file. Such a check might only help you realize you just deleted the text you wanted to process.


yeah agree..

here I was specifically checking append, where it doesn't get truncated...


If you redirect output, how can the command know the name of the destination?


Redirecting (in simple terms) is setting a file descriptor. So the file descriptor can be tested if it partakes in the input.

A quote for you [0]: Under normal circumstances every UNIX program has three streams opened for it when it starts up, one for input, one for output, and one for printing diagnostic or error messages. These are typically attached to the user's terminal (see tty(4) but might instead refer to files or other devices, depending on what the parent process chose to set up. (See also the "Redirection" section of sh(1).)

[0]: https://linux.die.net/man/3/stdout


fstat(2) on the output file descriptor, stat(2) on the input file name, see if they have the same block device and inode numbers.


Somewhat more common use of this is to check if output goes to terminal so that terminal color codes don't get printed to files.


Which is really annoying when you want to pipe grep to less, and need to add --color=always to get grep to understand less takes color codes (when using less -R).


I did enjoy how about halfway down the author bemoaned the fact that the C language does not have line numbers. (-:


It was phrased poorly, but I think they meant the Apple's source browser doesn't show line numbers.


Brilliant write up. Was very easy to grasp!


Some things are not meant to be questioned.


Agreed, this analysis helps nobody.


Ugh. Flash backs of my older brother's torment when he wanted to use the computer and I wouldn't get out of the way fast enough.

"Quit grepping yourself! Quit grepping yourself!"

#IKnowYourAreButWhoAmI?


Ah, the oldschool version of "quit googling yourself."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: