Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Four Column ASCII (2017) (garbagecollected.org)
142 points by petee on Sept 25, 2019 | hide | past | favorite | 40 comments


Fun fact: the delete character was designed to be all 1s in order to allow for human errors: "If a character was punched erroneously, punching out all seven bits caused this position to be ignored or deleted" [1]

[1] https://en.wikipedia.org/wiki/Delete_character


We called it RUBOUT back in the day.

When you were punching a paper tape, you couldn't "delete" a character, unless you wanted to risk cutting and splicing the tape, and good luck with not jamming the reader.

But by convention, a RUBOUT character (with all holes punched) would be ignored. So if you made a mistake you could backspace the tape and punch RUBOUT.

RUBOUT had another useful purpose. It's well known that the DOS/Windows line ending convention (CR/LF: Carriage Return/Line Feed) comes from the days of Teletypes, but what is less known is that we didn't actually just punch CR and LF. Sometimes on a Teletype that wasn't perfectly maintained, the CR and LF would not give enough time for the carriage to settle on column 1.

So we always punched RETURN, LINE FEED, and RUBOUT. The RUBOUT would be ignored, and it added a little extra time for the carriage to settle into place and not get a blurry character in the first column.

Here's a nice picture (although the discussion isn't completely on the mark):

https://www.reddit.com/r/MechanicalKeyboards/comments/2v3k0p...


More trivia: ASCII characters were designed to support overstrike (using BS) to generate more characters. For example, á (á) can be written as a BS '. Acute, grave, and circumflex accents, and umlauts, can be written this way on lower-case letters. ñ is n BS ~. Cedille (ç) is c BS comma. Underline is character BS underscore; strikethrough is similar, using dash or tilde. Bold is character BS same_character.

Many Compose key sequences are based on this.

So ASCII, in a way, may well be the first -or one of the first- variable-length codeset.

Note that overstriking upper-case letters to add diacritical marks does not work for ASCII (see more below).

Also, overstriking was how one typed diacritical marks back in the days of mechanical typewriters. Spanish used to (and may still? I forget) permit upper-case letters to not carry diacritical marks precisely because it was difficult or costly to get typewriters and printers to print such letters. Adding diacritical marks to upper-case letters requires fonts designed for that, and clearly an overstrike sequence cannot work unless the typewriter/printer holds printing the character until it knows whether the next character is BS.


This reminds me a little of baudot, which was 5bit but the shift character was stateful. So you had two character sets (nominally "letters" and "figures"). So sending "AA11" would be five symbols (A, A, FIGS, 1, 1), but sending "A1A1" would be seven (A, FIGS, 1, LTRS, A, FIGS, 1).

It entertains me how many of these issues were addressed long before we had computers.

(Aside; could I argue morse was a variable-length encoding?)


_.___ . ...


What Morse Code character is "dah dit dah dah dah"? It's not one I recognize.

Ah, this must be what you meant, yes?

—•—— • •••


-.-.

I think some fonts are not great at rendering Morse code :)

There may be a dah dit dah dah dah, but I don't know it either. But then again I don't know the less common punctuation, Cyrillic Morse, and I think heard once there is even Kanji Morse.

...-.-


None of those things are ASCII.


ASCII also includes control characters for delimiting text records. If people used these purposed-designed characters instead of comma or tab characters as delimiters, we could avoid many headaches quoting and escaping CSV data.

https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Field_...

  ASCII 28 for File Separator
  ASCII 29 for Group Separator
  ASCII 30 for Record Separator
  ASCII 31 for Unit Separator


This is a terrible idea! You're not avoiding anything, you would still have to escape these characters in CSV because it's entirely possible (in fact, almost certain) that someone will want to use the character itself in a CSV file.

Besides, it's arguably a GOOD thing that the delimiter in CSV (the comma) is such a common character, because it forces all parsers to properly support escaping and quoting. If it used some sort of unique character that was almost never used except in CSV, then 99.9% it would work without correct escaping, but the 0.01% when someone entered "you should use the character X to separate fields" as a column it would fail, and those cases are much more likely to slip through the cracks.

Also: the whole point of CSV is to be human readable. There's no obvious way to render "Record separator" on screen, and the format would essentially become a binary format.


> it's entirely possible (in fact, almost certain) that someone will want to use the character itself in a CSV file.

Our business has parsed and emitted significant volumes of data have never encountered one of these code points in the wild.

My understanding of why not to use these code points is that they don't have printable glyphs and thus don't easily lend themselves to ad hoc data exploration tools.


Unfortunately the comma is the decimal separator in some locales but the thousands separator in others. As the US doesn’t really use commas in floating point numbers escaping these is often an afterthought causing issues that only surface abroad. Another reason why CSV is a terrible format.


Quoting/escaping would still be necessary, but the need would be less obvious. It would be deferred from a common annoyance at development time to a common cause of nasty surprises in production.


And replace them with a whole new set of headaches, like difficult typing and copy-pasting.


Typing should be easy. Just type ctrl-], ctrl-/, ctrl-[ or ctrl-^. In the bash shell, you may need to type the quote character (ctrl-v) first, depending on your try settings. Copy and paste worked fine for me in lxterminal an xterm.


Typing those in anywhere that’s not a shell is still a pain.


Isn't that mainly because these weren't norm? For the vast majority of text editing software you do not type Ctrl+J for a new line after all.


Related, I love the site https://www.asciihex.com/ for giving some nice history into many of the ASCII characters. And stylin' to boot.

It's my go-to ascii table when I don't have the `ascii` command-line tool.


I didn't know of the `ascii` command-line tool, so thanks. What I've always tended to use when I need to look at an ascii table is `man 7 ascii`.

In retrospect, maybe I did know of the existence of a command called `ascii`. It's the reason why I needed to specify the section to `man`, since it would otherwise take me to ascii(1).


try this python when you are missing the 'ascii' command: https://tinyurl.com/y6bn5q5p


Thanks, but, you know, I can just install it or continue using `man 7 ascii`


Also note that you get punctuation by making the shift key turn on the top bit of the numbers, except that 2 becomes “ — and back in the day, “ was in fact shift-2 on computer keyboards.


Most modern keyboard layouts no longer map nicely to ASCII, except for one: Japanese. On the JIS keyboard layout, every single key* with an ASCII symbol flips a single bit when shifted (0x10 for codes < 0x40 and 0x20 for codes >= 0x40). It's called a bit-paired keyboard: https://en.wikipedia.org/wiki/Bit-paired_keyboard

Interestingly, shift-0 would logically be a space character, and indeed on the JIS layout there is nothing printed above 0 (actually pressing shift-0 doesn't give you a space on modern OSes, it just gives you a 0, but you could map it to a space if you were so inclined).

* except that ¥ is where \ would be in ASCII, because Japanese charsets replaced \ with ¥. On the JIS layout, the \ label is next to _ (0x5f), which means it's filling the empty spot of 0x7f, which is otherwise a nonprintable DEL.


> back in the day, “ was in fact shift-2 on computer keyboards.

It’s still located there on many international keyboard layouts:

https://en.wikipedia.org/wiki/QWERTY#International_variants


I always wondered why A and a start one further than the beginning of their column (with @ and ` before them), any ideas?


I guess that is an artifact of 5 bit encoding (invented in 19 century), as all zeros define NULL, see [0].

- [0] https://en.wikipedia.org/wiki/Baudot_code



The table is a nice way to view ASCII (maybe better transposed?), but the discussion on CTRL-[ for ESC can't be quite right, since then CTRL-; would work too, no?


The way I picture it; see the caps set (10) as the default. From there you can shift to the lower-case (11) set, figure (01) set, or control (00) set. But then there's no such thing as CTRL+; - you're switching to two different sets (figures and control) simultaneously. Your keyboard and OS may (should) be able express that, but ascii can't.


It appears that it would depend on the keyboard.

Some did a subtraction of the code and others did a bitwise AND.

Source: https://en.wikipedia.org/wiki/Control_character#How_control_...


Why's that? In my terminal emulator, CTRL-; just outputs a semicolon. Are there any cases where CTRL plus a character in the first or fourth columns has the same kind of behavior?

It seems to me that it's only the elements in the third column that can be used to generate elements of the first column. This is consistent with the description in the article, and still explains why ESC is represented by ^[.


Specifically I'm questioning this bit:

> Pressing CTRL simply sets all bits but the last 5 to zero in the character that you typed. You can imagine it as a bitwise AND.

If that were the case then CTRL-(any column) would result in the first column.

Edit: I suppose at this point terminal emulators are just hardcoding in CTRL modifiers rather than the truly emulating what a hardware terminal would have done.


I vaguely remember using CTRL-7 to get the ASCII BEL character (0x07). This was a long time ago, and I don't see it on any current systems.

Another special case: On most keyboards (I suppose it's the terminal emulator that implements it), CTRL-SPACE emits the NUL character. (I find that useful because I use C-@ as the prefix key in tmux.)


The article says:

> Pressing CTRL simply sets all bits but the last 5 to zero in the character that you typed. You can imagine it as a bitwise AND.

But it can't be quite that simple, or Ctrl-[, Ctrl-;, and Ctrl-{ would all have the same effect. There must (as you said, and contrary to the article) be additional logic that only zeroes the leading bits if they are "10".


Ahh, I understand. I must have scanned right over that part.

As the sibling says, it’s probably the case that terminal emulators have this historic behavior for specific combinations, and not for the more general mechanism of merely setting the high bits of whatever symbol was generated. It’s probably a bit fallacious to test things in an emulator... perhaps the article is correct for the original systems?


CTRL would clear the 2⁶ bit. If any sufficiently primitive devices (electromechanical like the Teletype 33) had been made in the lower-case era, you'd probably have got CTRL-; → ; and CTRL-{ → ;


Well this made my day. I just noticed esr's original piece was updated to include it in four columns too - so perhaps I'm not nuts.


I guess this is always going to be news to someone, but it's getting quite annoying to read about that every like 5 years or so.


Pish! The number of programmers supposedly doubles every 5 years (according to bob martin.) So there will _always_ be people who don't know as much as you and every 5 years people will hear it again for the first time. I'm thrilled to hear the young folk are as fascinated with this stuff as I am.


I decided to toss it up again because I came across a small 4-col chart I had while cleaning house, and was reminded not only of the simplicity, but my original surprise when I saw it on HN a couple years ago...I figured at least a few more people hadn't seen it yet and could appreciate it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: