Hacker Newsnew | past | comments | ask | show | jobs | submit | zzo38computer's commentslogin

I think a open-source exemption would be acceptable (if done properly; another comment mentions a possible problem), even though I would think it would be preferable to not have such age-verification bills at all. At least, adding the exemption would be second best, which would be better than having the age-verification bills without a open-source exemption.

(As another comment says, it is still not good, but at least it is something.)


This is the "lesser evil" trick politicians use to silence the opposition.

> Unfortunately no, Unicode is not simply a mapping of bytes to characters. It is a mapping of numbers to code points, and in some cases you can even get the same characters with multiple code point sequences (not a very good mapping!).

It is worse than that; you can also get different characters with the same code points, and also same code points and characters that should be different according to some uses, and also different code points and characters that should be same according to some uses, etc.


> A file isn't meaningful unless you know how to interpret it; that will always be true.

There are multiple levels of meaning, though; character encoding is just one part of it. For example, a text file might be plain text, or HTML, or JSON, or a C source code, etc; a binary file might be DER, or IFF, or ZIP, etc; and then there will be e.g. what kind of data a JSON or DER or IFF contains and how that level of the data is interpreted, etc.

> Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters.

Whether or not they are identical to ASCII characters depends on the character set and on other things, such as what they are being used for; the definition of "identical" is not so simple as you make it seem. Unicode defines them as not identical, which is appropriate for some uses but is wrong for other uses. (Unicode also defines some characters as identical even though in some uses it would be more appropriate to treat them as not identical, too. So, Unicode is both ways bad.)

> This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

I agree with that (although I think UTF-8 should not be used for Japanese either), but it isn't because of which characters are considered "identical" or not. There are problems with Unicode in general regardless of which encoding you use.


> ... (although I think UTF-8 should not be used for Japanese either) ...

The people putting up websites in Japanese disagree with you, it would seem. According to Wikipedia (in the Shift JIS article), as of March 2026 99% of websites in the .jp domain were in UTF-8, with only 1% being in Shift JIS.

Japan used to have two different encodings in common use, Shift JIS (usually used on Windows) and EUC-JP (more common on Unix servers). This resulted in characters being misinterpreted often enough that they coined the word mojibake to describe the phenomenon of text coming out completely garbled. These days, it seems Japanese website makers are more than happy to accept a slight inefficiency in encoding size, because what they gain from that is never having to see mojibake again.


If they are misinterpreted, it is because the character encoding is not declared properly.

I still sometimes see mojibake in Japanese web pages, but sometimes it works; if it works, it is because the character encoding is declared properly.

In my opinion, EUC-JP is a generally better encoding of JIS (especially in e.g. C source code, which should not use Shift-JIS but EUC-JP is OK), but Shift-JIS does have some benefits in some circumstances (such as making a character grid with one byte per character cell; if using Shift-JIS for a Pascal source code then you should use (* *) instead of { } for comments please).


> If they are misinterpreted, it is because the character encoding is not declared properly.

OR because the software is buggy, or making assumptions about encoding and not checking them (which also counts as "buggy", of course). You can declare the encoding all you like, it won't protect you against the stupid decisions that other people make in writing their software. (See Excel, for example).

Yes, if you declare your encoding properly, things should work. Most of the time. And if you're using any encoding that is not the worldwide default (which these days is UTF-8), then you definitely should declare the encoding. But you'll still occasionally hit badly-written software that doesn't even think about other encodings and doesn't handle them properly. The only defense against that situation, where you declare your encoding properly and it still doesn't work, is to just use the encoding that the software was written to expect, which is almost certainly the worldwide default.


Another way that the character encoding could be declared is ISO 2022. When using ISO 2022, the declaration of UTF-8 is <1B 25 47>, rather than the <EF BB BF> that XML and some other formats use.

However, whether you do it that way or another way, I think that the encoding declaration should not be omitted unless it is purely ASCII in which case the encoding declaration should be omitted.


Yes, I thought of what you mentioned too, and in my opinion, DER is a better format, and it is a binary format rather than text.

(In my ideas of an operating system design, there is a structured binary format (similar to DER but different) used for most files and data, so that the tools (and the command shell) would be usable consistently with most of them; and if some need special handling, you can use other programs and functions to convert them and/or handle them in a way that can be interoperable.)


I agree with you that Unicode is too complicated and messy, although it also shows that whether or not something is considered "plain" is itself too difficult.

Unicode has caused many problems (although it was common for m17n and i18n to be not working well before Unicode either). One problem is making some programs no longer 8-bit clean.

Unicode might be considered in two ways: (1) Unicode is an approximation of multiple other character sets, (2) All character sets are an encoding of a subset of Unicode. At best, if Unicode is used at all, it should be used as (1) (as a last resort), but it is too common for Unicode to be used as (2) (as a first resort), which is not good in my opinion.

(I mostly avoid Unicode in my software, although it is also often the case (and, in many (but not all) programs, should be the case) that it only cares about ASCII but does not prevent you from using any other character encodings that are compatible with ASCII.)

> ASCII is still the only text system that will really work well everywhere, which I consider a must for calling something plain text.

Yes, it does work well (almost) everywhere.

Supersets of ASCII are also common, including UTF-8, and PC character set, ISO 2022 (if ASCII is the initially selected G0 set, which it is in the ASN.1 Graphic string and General string types, as well in most terminal emulators), EUC-JP, etc. In these cases, ASCII will also usually work well.

However, as another comment mentions, and I agree with them, that if you mean "ASCII" then it is what you should say, rather than "plain text" which does not tell you what the character encoding is. That other comment says:

> Plain text is text intended to be interpreted as bytes that map simply to characters.

However, it is not always so clear and simple what "characters" is, depending on the character sets and what language you are writing. And then, there are also control characters, to be considered, so it is again not quite so "plain".

> And yes, ASCII means mostly limiting things to English but for many environments that's almost expected. I would even defend this not being a native English speaker myself.

In my opinion, it depends on the context and usage. One character set (regardless of which one it is) cannot be suitable for all purposes. However, for many purposes, ASCII is suitable (including C source code; you might put l10n in a separate file).

You should have proper m17n (in the contexts where it is appropriate, which is not necessarily all files), but Unicode is not a good way to do it.


All character setes do not encode subsets of Unicode.

Two well-known counterexamples that come immediately to mind:

1. Mac OS Roman includes a non-Unicode Apple logo.

2. The Atari ST character set includes two non-Unicode characters that combine to create an Atari logo, and 4 non-Unicode characters that combine to create a picture of J.R. "Bob" Dobbs [1].

[1] https://en.wikipedia.org/wiki/J._R._%22Bob%22_Dobbs


Yes, I know that (I was aware of both of these cases, as well as others). In those cases, there are characters that do not correspond to any Unicode characters.

Nevertheless, what I was saying is that many programs seem to be designed as though other character sets do encode subsets of Unicode, but actually they are different character sets and are not Unicode.

However, what I meant was, in addition to things like the examples that you have, other less obvious cases. Even if characters do correspond to Unicode (and sometimes there is more than one way to do it, which is the case for several PC characters), they are not necessarily supposed to work in the same way.

At best, Unicode could be used as an approximation if the other character sets cannot be used, although sometimes there are other ways to do it.


There are several reasons. One possible reason is, if you do not need the functions of other operating systems, then DOS will be much simpler.

The computer is not very usable without an operating system. I think it would be reasonable for the computer to have Forth or BASIC or something like that in ROM, like many older computers do, so that the computer is usable without an operating system (but that you could also install an operating system if you wanted it).

I have been told that ITU specifications are deliberately confusing so that they can sell consulting services.

However, I think DER is good (and is better than BER, PER, etc in my opinion). (I did make up a variant with a few additional types, though.)

OID is also a good idea, although I had thought they should add another arc for being based on various kind of other identifiers (telephone numbers, domain names, etc) together with a date for which that identifier is valid (to avoid issues with reassigned identifiers) as well as possibility of automatic delegation for some types (so that e.g. if you register an account on another system then you can get a free OID from it too; there is a bit of difficulty in some cases but it might be possible). (I have written a file about how to do this, although I did not publish it yet.)


Some users disable favicons; I am one of them (although that is main because I do not use them, rather than due to that).

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: