Fun fact: the delete character was designed to be all 1s in order to allow for human errors: "If a character was punched erroneously, punching out all seven bits caused this position to be ignored or deleted" [1]
When you were punching a paper tape, you couldn't "delete" a character, unless you wanted to risk cutting and splicing the tape, and good luck with not jamming the reader.
But by convention, a RUBOUT character (with all holes punched) would be ignored. So if you made a mistake you could backspace the tape and punch RUBOUT.
RUBOUT had another useful purpose. It's well known that the DOS/Windows line ending convention (CR/LF: Carriage Return/Line Feed) comes from the days of Teletypes, but what is less known is that we didn't actually just punch CR and LF. Sometimes on a Teletype that wasn't perfectly maintained, the CR and LF would not give enough time for the carriage to settle on column 1.
So we always punched RETURN, LINE FEED, and RUBOUT. The RUBOUT would be ignored, and it added a little extra time for the carriage to settle into place and not get a blurry character in the first column.
Here's a nice picture (although the discussion isn't completely on the mark):
More trivia: ASCII characters were designed to support overstrike (using BS) to generate more characters. For example, á (á) can be written as a BS '. Acute, grave, and circumflex accents, and umlauts, can be written this way on lower-case letters. ñ is n BS ~. Cedille (ç) is c BS comma. Underline is character BS underscore; strikethrough is similar, using dash or tilde. Bold is character BS same_character.
Many Compose key sequences are based on this.
So ASCII, in a way, may well be the first -or one of the first- variable-length codeset.
Note that overstriking upper-case letters to add diacritical marks does not work for ASCII (see more below).
Also, overstriking was how one typed diacritical marks back in the days of mechanical typewriters. Spanish used to (and may still? I forget) permit upper-case letters to not carry diacritical marks precisely because it was difficult or costly to get typewriters and printers to print such letters. Adding diacritical marks to upper-case letters requires fonts designed for that, and clearly an overstrike sequence cannot work unless the typewriter/printer holds printing the character until it knows whether the next character is BS.
This reminds me a little of baudot, which was 5bit but the shift character was stateful. So you had two character sets (nominally "letters" and "figures"). So sending "AA11" would be five symbols (A, A, FIGS, 1, 1), but sending "A1A1" would be seven (A, FIGS, 1, LTRS, A, FIGS, 1).
It entertains me how many of these issues were addressed long before we had computers.
(Aside; could I argue morse was a variable-length encoding?)
I think some fonts are not great at rendering Morse code :)
There may be a dah dit dah dah dah, but I don't know it either. But then again I don't know the less common punctuation, Cyrillic Morse, and I think heard once there is even Kanji Morse.
ASCII also includes control characters for delimiting text records. If people used these purposed-designed characters instead of comma or tab characters as delimiters, we could avoid many headaches quoting and escaping CSV data.
This is a terrible idea! You're not avoiding anything, you would still have to escape these characters in CSV because it's entirely possible (in fact, almost certain) that someone will want to use the character itself in a CSV file.
Besides, it's arguably a GOOD thing that the delimiter in CSV (the comma) is such a common character, because it forces all parsers to properly support escaping and quoting. If it used some sort of unique character that was almost never used except in CSV, then 99.9% it would work without correct escaping, but the 0.01% when someone entered "you should use the character X to separate fields" as a column it would fail, and those cases are much more likely to slip through the cracks.
Also: the whole point of CSV is to be human readable. There's no obvious way to render "Record separator" on screen, and the format would essentially become a binary format.
> it's entirely possible (in fact, almost certain) that someone will want to use the character itself in a CSV file.
Our business has parsed and emitted significant volumes of data have never encountered one of these code points in the wild.
My understanding of why not to use these code points is that they don't have printable glyphs and thus don't easily lend themselves to ad hoc data exploration tools.
Unfortunately the comma is the decimal separator in some locales but the thousands separator in others. As the US doesn’t really use commas in floating point numbers escaping these is often an afterthought causing issues that only surface abroad. Another reason why CSV is a terrible format.
Quoting/escaping would still be necessary, but the need would be less obvious. It would be deferred from a common annoyance at development time to a common cause of nasty surprises in production.
Typing should be easy. Just type ctrl-], ctrl-/, ctrl-[ or ctrl-^. In the bash shell, you may need to type the quote character (ctrl-v) first, depending on your try settings. Copy and paste worked fine for me in lxterminal an xterm.
I didn't know of the `ascii` command-line tool, so thanks. What I've always tended to use when I need to look at an ascii table is `man 7 ascii`.
In retrospect, maybe I did know of the existence of a command called `ascii`. It's the reason why I needed to specify the section to `man`, since it would otherwise take me to ascii(1).
Also note that you get punctuation by making the shift key turn on the top bit of the numbers, except that 2 becomes “ — and back in the day, “ was in fact shift-2 on computer keyboards.
Most modern keyboard layouts no longer map nicely to ASCII, except for one: Japanese. On the JIS keyboard layout, every single key* with an ASCII symbol flips a single bit when shifted (0x10 for codes < 0x40 and 0x20 for codes >= 0x40). It's called a bit-paired keyboard: https://en.wikipedia.org/wiki/Bit-paired_keyboard
Interestingly, shift-0 would logically be a space character, and indeed on the JIS layout there is nothing printed above 0 (actually pressing shift-0 doesn't give you a space on modern OSes, it just gives you a 0, but you could map it to a space if you were so inclined).
* except that ¥ is where \ would be in ASCII, because Japanese charsets replaced \ with ¥. On the JIS layout, the \ label is next to _ (0x5f), which means it's filling the empty spot of 0x7f, which is otherwise a nonprintable DEL.
The table is a nice way to view ASCII (maybe better transposed?), but the discussion on CTRL-[ for ESC can't be quite right, since then CTRL-; would work too, no?
The way I picture it; see the caps set (10) as the default. From there you can shift to the lower-case (11) set, figure (01) set, or control (00) set. But then there's no such thing as CTRL+; - you're switching to two different sets (figures and control) simultaneously. Your keyboard and OS may (should) be able express that, but ascii can't.
Why's that? In my terminal emulator, CTRL-; just outputs a semicolon. Are there any cases where CTRL plus a character in the first or fourth columns has the same kind of behavior?
It seems to me that it's only the elements in the third column that can be used to generate elements of the first column. This is consistent with the description in the article, and still explains why ESC is represented by ^[.
> Pressing CTRL simply sets all bits but the last 5 to zero in the character that you typed. You can imagine it as a bitwise AND.
If that were the case then CTRL-(any column) would result in the first column.
Edit: I suppose at this point terminal emulators are just hardcoding in CTRL modifiers rather than the truly emulating what a hardware terminal would have done.
I vaguely remember using CTRL-7 to get the ASCII BEL character (0x07). This was a long time ago, and I don't see it on any current systems.
Another special case: On most keyboards (I suppose it's the terminal emulator that implements it), CTRL-SPACE emits the NUL character. (I find that useful because I use C-@ as the prefix key in tmux.)
> Pressing CTRL simply sets all bits but the last 5 to zero in the character that you typed. You can imagine it as a bitwise AND.
But it can't be quite that simple, or Ctrl-[, Ctrl-;, and Ctrl-{ would all have the same effect. There must (as you said, and contrary to the article) be additional logic that only zeroes the leading bits if they are "10".
Ahh, I understand. I must have scanned right over that part.
As the sibling says, it’s probably the case that terminal emulators have this historic behavior for specific combinations, and not for the more general mechanism of merely setting the high bits of whatever symbol was generated. It’s probably a bit fallacious to test things in an emulator... perhaps the article is correct for the original systems?
CTRL would clear the 2⁶ bit. If any sufficiently primitive devices (electromechanical like the Teletype 33) had been made in the lower-case era, you'd probably have got CTRL-; → ; and CTRL-{ → ;
Pish! The number of programmers supposedly doubles every 5 years (according to bob martin.) So there will _always_ be people who don't know as much as you and every 5 years people will hear it again for the first time. I'm thrilled to hear the young folk are as fascinated with this stuff as I am.
I decided to toss it up again because I came across a small 4-col chart I had while cleaning house, and was reminded not only of the simplicity, but my original surprise when I saw it on HN a couple years ago...I figured at least a few more people hadn't seen it yet and could appreciate it.
[1] https://en.wikipedia.org/wiki/Delete_character