What happens when you grep the file you've redirected grep to?

userbinator · on Jan 23, 2018

which means that when we do the write in that for-loop, we’re guaranteed that the read which happens after in the while-loop must see the result of that write. So: we read an “a”, write an “a”, and when we get back to the while-loop we’re guaranteed that the next read will see that “a” we just wrote.

In other words, it's behaving like a FIFO that persists a history of everything that's been written to it, which also suggests that there are two pointers, one for the read position and another for the write position; all becomes clear when we realise that it's not reading and writing with the same file descriptor, but rather two file descriptors of one file, opened twice.

The tl;dr of why small sizes terminate and large sizes loop infinitely is this: when the size of data output is not enough to fill the write buffer, the read can reach the end of file (0 bytes read), and then it flushes the write buffer before terminating; but when the output size fills the write buffer, it gets flushed and ends up back in the input, causing the infinite loop.

The C stdio buffering layer doesn't guarantee that data written with a fwrite() will show up in a fread() from a different file handle that references the same file (and now that I think about it, that would seem to be quite difficult to guarantee.)

tankenmate · on Jan 23, 2018

It's not overly difficult to guarantee, the UNIX/Linux kernel does it just fine. It does require more work in that the libc file cache will need to invalidate itself if the source file / socket changes state (either via atime if the filesystem supports it or via inotify/fnotify). Obviously invalidating the cache will lead to lower performance but more correct behaviour.

NiceGuy_Ty · on Jan 23, 2018

I did this before when using ripgrep, and accidentally created a several hundred GB file. Now ripgrep checks if this is the case and handles it gracefully: https://github.com/BurntSushi/ripgrep/pull/310

repsilat · on Jan 23, 2018

A really fun one is having `bash` execute its output using a FIFO. It worked when I tried it last, but it'd obviously be more than crazy to rely on it (for more than one reason.)

userbinator · on Jan 23, 2018

Or for even more fun, self-modifying shell scripts:

https://stackoverflow.com/questions/3398258/edit-shell-scrip... (curiously, the accepted answer is incorrect and downvoted highly)

...and Windows batch files:

https://stackoverflow.com/questions/906586/changing-a-batch-...

_jbez · on Jan 23, 2018

Self modifying code is wandering into the (very interesting IMO) realms of metamorphic and polymorphic code. I started an SO questionable here: https://stackoverflow.com/questions/10113254/metamorphic-cod...

I'll try to find some time soon to port that C+inline assembly example to Linux, I didn't know much C or ASM back when I originally asked the question.

13of40 · on Jan 23, 2018

Reminds me of this old chestnut:

type foo.bat >> foo.bat

ScottBurson · on Jan 23, 2018

A variation you can run into with GNU grep:

  grep -r [something] . >foo

I've done this by accident once or twice :-)

AnimalMuppet · on Jan 23, 2018

The way I've done it:

  grep -r [something] * > foo

What I now do instead:

  grep -r [something] * > .foo
  mv .foo foo

Why it works: At least on bash, "*" does not match hidden files (files that begin with "." such as ".foo").

JepZ · on Jan 23, 2018

Reminds me of how easy it is to destroy a file with sed:

  sed 's/a/b/' test.txt > test.txt

After that command the file is empty no matter what was in there before. Instead you have to use the -i flag like:

  sed 's/a/b/' -i test.txt

And different operating systems seem to behave differently:

https://stackoverflow.com/questions/5171901/sed-command-find...

userbinator · on Jan 23, 2018

That's not due to sed, but the shell; the > operator always opens the output file for writing in create-if-not-exists, truncate-existing mode, and that redirection processing happens before any commands are executed.

Ao7bei3s · on Jan 23, 2018

sed -i also has the less obvious advantage that it is atomic.

(It writes to a temporary file, and then atomically renames it to the final location. Whereas if sed expr src > dst is killed, dst may be corrupt. You'd need to handle temporary file management yourself if your script needs to be reliable.)

contingencies · on Jan 23, 2018

My memory is to sed -i -e 's/pattern/replacement/' file, the addition of -e was perhaps subconsciously to help protect against this eventuality by mentally segregating my standard pipeline invocation which takes neither argument.

Anyway sed is so fast you should definitely check the output before overwriting stuff. Same goes for anything on the command line.

Get ZFS. Do snapshots.

ramshorns · on Jan 23, 2018

> you should definitely check the output before overwriting stuff

That wouldn't really help in the case of redirecting sed to the same file.

    $ echo "a" > test.txt
    $ sed s/a/b/ test.txt
    b
    $ sed s/a/b/ test.txt > test.txt
    $ cat test.txt
    $

contingencies · on Jan 23, 2018

Indeed. Use >> in preference to > at all times ;)

goialoq · on Jan 23, 2018

noclobber

https://en.wikipedia.org/wiki/Clobbering

jle17 · on Jan 23, 2018

Maybe this is just me but when I enable that king of options I end up getting the automatism of bypassing them (like typing rm -f everytime if rm has been aliased to rm -i).

I prefer having a way of reverting changes rather than something that gets in the way.

icebraining · on Jan 23, 2018

As a sidenote, if you're using a command that doesn't have -i, or a chain of commands where the file is read at the start and written at the end, you can use sponge(1) from moreutils to achieve that without temporary files:

  sed "s/root/toor/" /etc/passwd | grep -v joey | sponge /etc/passwd

https://joeyh.name/code/moreutils/

kps · on Jan 23, 2018

ksh has ‘>;’, but bash hasn't picked that up. This better fits the compositional Unix philosophy than sprinkling a subset of individual commands with the functionality.

Sean1708 · on Jan 23, 2018

Does >; have an ascii name? Symbol Hound isn't bringing up any useful results.

kps · on Jan 23, 2018

I don't know what you mean by ASCII name. It's two characters, ‘>’ (ASCII 03/14) followed by ‘;’ (03/11). I guess the ‘;’ is mnemonic for the sequential nature of the operation, i.e. renaming the temporary file after the command completes, as well as forming a combination that is a syntax error in sh.

Sean1708 · on Jan 24, 2018

When I said ASCII I actually meant alphanumerical, sorry. Basically I couldn't find any information about it on search engines so I was wandering if a had a name that I could use to google it.

kps · on Jan 24, 2018

I think the man page¹ is the only readily available documentation.

  >;word  Write output to a temporary file. If the command
          completes successfully rename it to word, otherwise,
          delete the temporary file. >;word cannot be used
          with the exec(2) built-in.

¹ e.g. https://www.freebsd.org/cgi/man.cgi?query=ksh93&apropos=0&se...

dima55 · on Jan 23, 2018

"apt install moreutils". Then look up the "sponge" tool

IgorPartola · on Jan 23, 2018

Stupid question: in the cases where the input is a regular file, why doesn’t grep check the size of the file at the beginning? Is t actually desired behavior to search through stuff added to the file while we are processing the beginning of it?

tinus_hn · on Jan 23, 2018

The rules are that you open the file for reading until you get end-of-file. If you don’t follow the rules unexpected things go wrong. For instance it may not even be possible to get the file size.

That is not suitable for a program like grep.

rocqua · on Jan 23, 2018

In Unix we have the idiom of "Everything Is A File". As such, when grep gets a file descriptor, it cannot always get the file-size, the file-size might be infinite, or might vary over time. Instead, like other have said, the only way to know you've read 'until the end' is to read reports you've reached the end of the file.

Consider for example what should happen when grepping /dev/null/? Or, a more sensible case, piping the output of some command to grep. Grep will read from 'standard in' which is "Just A File" so it just calls read until it reports end of file.

IgorPartola · on Jan 23, 2018

> Consider for example what should happen when grepping /dev/null/

In the solution I gave, same thing as before because I said regular file. That is if you can open /dev/null for reading, which I thought you couldn't.

I think I'd be fine with saying that the infinite loop is the behavior that happens. GNU's grep obviously is doing some janky check which it shouldn't be. Consider what it will do when you get just slightly more clever and it can't determine the input/output types. The one on macOS has arbitrary file size-dependent behavior, which is problematic. Doesn't seem like either of them does a consistent thing.

jameshart · on Jan 23, 2018

But the article shows GNU grep actually looks ‘behind the curtain’ and identifies that input and output are the same file, then refuses to play. So everything is a file, but some are more filey than others.

raldi · on Jan 23, 2018

> And about 15 up-arrow+enters later

Gonna blow your mind here: Press up once and then Ctrl-O.

ship_it · on Jan 23, 2018

Nice, works for `/bin/bash` but not `zsh`. Still nice trick.

JdeBP · on Jan 23, 2018

It works in at least version 5.2 of the Z shell, where Control-O is bound to accept-line-and-down-history.

stormbrew · on Jan 23, 2018

The section about the guarantees of read/write doesn't seem entirely correct to me. For sure it's relevant to why the data is there, but the reason it doesn't terminate is just because read doesn't tell you there's an eof until you try to read again past the end. It would be entirely possible to construct a version of this loop that would terminate. Though it would be awkward.

emilfihlman · on Jan 23, 2018

>echo "a" > test.txt

test.txt will contain "a\n" not just "a"

-n to disable adding \n

asicsp · on Jan 23, 2018

good one..

when I checked on Linux, both `cat` and `grep` give error when input file name is same as output... but not `sed/awk/head/tail/sort/etc`..

raimue · on Jan 23, 2018

In most cases, it is too late anyway at that point, as the shell redirection already truncated the file. Such a check might only help you realize you just deleted the text you wanted to process.

asicsp · on Jan 24, 2018

yeah agree..

here I was specifically checking append, where it doesn't get truncated...

goialoq · on Jan 23, 2018

If you redirect output, how can the command know the name of the destination?

xxs · on Jan 23, 2018

Redirecting (in simple terms) is setting a file descriptor. So the file descriptor can be tested if it partakes in the input.

A quote for you [0]: Under normal circumstances every UNIX program has three streams opened for it when it starts up, one for input, one for output, and one for printing diagnostic or error messages. These are typically attached to the user's terminal (see tty(4) but might instead refer to files or other devices, depending on what the parent process chose to set up. (See also the "Redirection" section of sh(1).)

[0]: https://linux.die.net/man/3/stdout

aaronmdjones · on Jan 23, 2018

fstat(2) on the output file descriptor, stat(2) on the input file name, see if they have the same block device and inode numbers.

Karliss · on Jan 23, 2018

Somewhat more common use of this is to check if output goes to terminal so that terminal color codes don't get printed to files.

rocqua · on Jan 23, 2018

Which is really annoying when you want to pipe grep to less, and need to add --color=always to get grep to understand less takes color codes (when using less -R).

JdeBP · on Jan 23, 2018

I did enjoy how about halfway down the author bemoaned the fact that the C language does not have line numbers. (-:

jwilk · on Jan 24, 2018

It was phrased poorly, but I think they meant the Apple's source browser doesn't show line numbers.

vijaybritto · on Jan 23, 2018

Brilliant write up. Was very easy to grasp!

starpilot · on Jan 23, 2018

Some things are not meant to be questioned.

chickenthief · on Jan 23, 2018

Agreed, this analysis helps nobody.

ourmandave · on Jan 23, 2018

Ugh. Flash backs of my older brother's torment when he wanted to use the computer and I wouldn't get out of the way fast enough.

"Quit grepping yourself! Quit grepping yourself!"

#IKnowYourAreButWhoAmI?

taneq · on Jan 23, 2018

Ah, the oldschool version of "quit googling yourself."