Ad-hoc data formats like JSON and XML are too insecure for the modern world, so ...

minhmeoke · on Jan 2, 2022

An alternative approach might be to use an existing popular serialization format such as Protocol Buffers, Apache Thrift, or Cap'N'Proto and create or improve tools that convert to/from human-readable text formats to the serialized binary format.

For example:

- Protocol buffers have a text format mode: https://medium.com/@nathantnorth/protocol-buffers-text-forma...

- Thrift has readable-thrift which is a human-friendly encoder and decoder: https://github.com/nccgroup/readable-thrift

- Cap'N'Proto has a `capnp` tool for encoding/decoding text representations to binary which seems to be officially supported and documented! https://capnproto.org/capnp-tool.html

These libraries have been battle-tested by major companies in production, some protocols and implementations have gone through security audits, and in addition each of these formats already has many language bindings, for example:

- Protocol Buffer Third-party language bindings: https://github.com/protocolbuffers/protobuf/blob/master/docs...

- Apache Thrift language support: https://github.com/apache/thrift/blob/master/LANGUAGES.md

- Cap'N'Proto in other languages: https://capnproto.org/otherlang.html

kstenerud · on Jan 2, 2022

Yes, I had a look at these formats before embarking on my venture. I listed the things I found important in the comparison matrix: https://github.com/kstenerud/concise-encoding#-compared-to-o...

To your points:

- Protobufs is not an ad-hoc format, which is a big reason why low-friction formats like JSON are popular. There are many use cases where formats like protobufs are clearly the superior choice, but CE doesn't target those. This is a fundamental trade-off so you can't have both.

- readable-thrift is a diagnostic tool. You wouldn't want to be inputting data like that. I want the text format to be fully usable by non-technical people, like JSON is.

- the capn proto tool page doesn't seem to document how the text format works (or at least I couldn't find any examples). It looks more like a diagnostic tool, not a first-class citizen.

I felt that there were enough pain points, missing types, and missing security features (for example versioning) to warrant a fresh start.

kentonv · on Jan 2, 2022

> the capn proto tool page doesn't seem to document how the text format works (or at least I couldn't find any examples). It looks more like a diagnostic tool, not a first-class citizen.

Cap'n Proto's text format works pretty much exactly like Protobuf's. You can use it in all the same ways.

minhmeoke · on Jan 2, 2022

Thank you for the clarification! Perhaps a better example would have been Apache Avro (still has a schema, though): https://en.wikipedia.org/wiki/Apache_Avro

This looks like a very ambitious project, and I can see that you've put a lot of thought, time, and effort into it! You clearly have a lot of interesting ideas (the graph idea is really cool) and significant experience with data formats.

If this is a security-oriented application, then with cyclic data structures there is the risk of blowing out your server's memory using something like a fork bomb when processing untrusted user input (https://en.wikipedia.org/wiki/Fork_bomb).

There are some systems like DHall that guarantee termination by putting upper bounds on computation: https://dhall-lang.org/

I'm also a bit concerned with how the different features can interact, for example it's not super clear how to distinguish between UTC offset (-130, or do these always have to be 4 digits?) and global coordinates (-130/-172). An attacker could specify a comment inside the media type (eg: application/* which would require special logic to filter out).

My concern is that the parser will become extremely complicated and require a lot of special-case logic and validation (eg: there must be at least one digit on each side of a radix point) which is more prone to errors and unexpected behaviors.

Rather than using slash delimiters, I'd recommend splitting the time formats into subfields, eg: { date: "2022-01-01" time: "21:14:10" offset_is_negative: true offset: "10:30" }

This does make the text format more verbose, but it reduces ambiguity and makes the parsing faster as well since you don't need to descend into branches and backtrack when they don't match, and also might permit more code/logic reuse.

It's also not clear how easy it is to add new data types to the grammar. Based on the project description, it seems like you're using ANTLDR parser.

Since you seem to be quite interested in parsing, you might also be interested in parser combinators which are a somewhat different approach with different tradeoffs: - https://softwareengineering.stackexchange.com/questions/3386... - https://fsharpforfunandprofit.com/posts/understanding-parser...

kstenerud · on Jan 2, 2022

Yes, I had a look at avro as well. I've been following all of the established and nascant formats over a number of years, hoping for one that addresses my concerns, but unfortunately nothing emerged. My ambitions are actually at a much higher level; this is just to set a solid foundation for them.

Cyclic bombs are but one security concern... There are actually a LOT of them, which I try to cover cover in the security section ( https://github.com/kstenerud/concise-encoding/blob/master/ce... ). The security space is of course wider and more nuanced than this, but I didn't want to turn it into an entire tome so I tried to cover the basic philosophical problems. At the end of the day, you must treat data crossing boundaries as hostile, and build your ingestors with that in mind. Sane defaults can avoid the worst of them (and CE actually REQUIRES sane defaults for a lot of things in order to be compliant), but no format can protect you completely. A "fork bomb" using cyclic data is unlikely, unless your application code is really naive (if you're using cyclic data, you need to have a well-reasoned purpose for it, and are likely just using pointers internally - which won't blow out your memory unless you're doing something foolish when processing the resulting structs). Actually, this does give me an idea... make cyclic data disallowed by default, just to cover the common case where people don't use it and don't even want to think about it.

Re time formats: global coordinates will always start with a slash, so 12:00:00/-130/-172. UTC offsets will always start with + or -, and be 4 digits long, so 12:00:00+0130 or 12:00:00-0130.

The validation rules are very specific, and that does complicate the text format a bit, but this drives to the central purpose of it: The text format is for the USER, and is not what you send to other machines or foreign systems. It's for a user to edit or inspect or otherwise interact with the data on the RARE occasions where that is necessary. So the text format doesn't need to be fast or efficient, only unambiguous and easy for a human to read. You certainly shouldn't open an internet connected service that accepts the text format as input (except maybe during development and debugging...) In fact, I would expect a number of CE implementations (such as for embedded systems) to only include CBE support, since you could just use a standalone command-line tool or the like to analyze the data in most cases.

Re: subfields. That would make it harder for a human to read. The text format sacrifices some efficiency for human friendliness and better UX. Parser logic re-use isn't really a priority (other than making sure it's not OBVIOUSLY bad for the parser), because text parsing/encoding is supposed to be the 0.0001% use case.

It's not super easy to add new types to the text format grammar, but that's fine because human friendliness trumps almost all, and adding new types should be done with EXTREME CARE. I've lost count of all the types I've added and then scrapped over the years. It's really hard to come up with these AND justify them!

The ANTLR grammar is actually more of a documentation thing. I've verified it in a toy parser but it's not actually tied to the reference implementation (yet). The reference implementation currently is similar to a parser combinator, with a lot of inspiration from the golang team's JSON parser (I watched a talk by the guy some time ago and was impressed). But at the same time I'm starting to wonder if it might have been better to implement the reference implementation as just an ANTLR parser after all... leave the optimizations and ensuing complications to other implementations and keep the reference implementation readable and understandable. The binary format code is super simple, and about 1/3 the size of the text format code. The major downside of ANTLR of course is the terrible error reporting.

minhmeoke · on Jan 3, 2022

Thank you for the detailed and comprehensive explanations!

> There are actually a LOT of [security concerns], which I try to cover in the security section

If you'd like to eventually harden the binary implementations, you might also be interested in coverage-guided fuzz testing which feeds random garbage data to a method to try and find errors in it: https://llvm.org/docs/LibFuzzer.html

as well as maybe some kind of optional checksum or digital signature to ensure that the payload has not been tampered with (although perhaps this should be performed in another higher layer of the stack).

> make cyclic data disallowed by default, just to cover the common case where people don't use it and don't even want to think about it.

Yes, I think that making it an option which is restrictive (safe) by default would be a great idea. Or perhaps separating out the more dynamic types (eg: graphs, markup, binary data) to be loadable modules could also reduce the default attack surface area.

> You certainly shouldn't open an internet connected service that accepts the text format as input (except maybe during development and debugging...)

Yes, I fully agree with this! I initially assumed that the text format could be sent from an untrusted client similar to JSON and XML, but this makes more sense.

> because text parsing/encoding is supposed to be the 0.0001% use case

I see, so the main use case of the CTE text format is rapid prototyping, and then the user should convert to the CBE binary format in production?

> It's not super easy to add new types to the text format grammar

Customizable types could be a really great way to differentiate from other serialization protocols. I did notice that the system allows the user to define custom structs which is quite useful.

Another approach would be to embed the grammar and parser into an existing language like Python, Rust, or Haskell, and let the user define their own custom types in that language. In my experience, custom types help prevent a lot of errors (eg: for a fitness tracker IoT application, you could define separate types for ip_v4 address, duration_milliseconds, temperature_celsius, heart_rate_beats_per_minute, blood_pressure_mm_hg for systolic and diastolic blood pressure rather than using just floating point or fixed-point numbers, and this could prevent many potential unit conversion and incorrect variable use errors at compile-time). Or you could better model your domain with custom types (eg: reuse the global coordinate datastructure from the timezones implementation to create path or polygon types using repeated coordinates).

> adding new types should be done with EXTREME CARE

maybe it would make sense to create a small set of core types (kind of like a standard library), and then permit extensions via user-defined types which must be whitelisted? But pursuing that route could end up addressing a very different niche (favoring a stricter schema) in the design space.

> The major downside of ANTLR of course is the terrible error reporting.

This is a major advantage of the parser combinator approach, in that it is possible to design them to emit very helpful and context-aware error messages, for example look at the examples at the end of: https://www.quanttec.com/fparsec/users-guide/customizing-err...

Anyway, hope this was useful and I wish you good luck with your project!

kstenerud · on Jan 4, 2022

> If you'd like to eventually harden the binary implementations, you might also be interested in coverage-guided fuzz testing which feeds random garbage data to a method to try and find errors in it: https://llvm.org/docs/LibFuzzer.html

Yes, I plan to fuzz the hell out of the reference implementation once it's done. So much to do, so little time...

> I see, so the main use case of the CTE text format is rapid prototyping, and then the user should convert to the CBE binary format in production?

CTE would be for prototyping, initial data loads, debugging, auditing, logging, visualizing, possibly even for configuration (since the config would be local and not sourced from unknown origin). Basically: CBE when data passes from machine to machine, and CTE only where a human needs to get involved.

> Another approach would be to embed the grammar and parser into an existing language like Python, Rust, or Haskell, and let the user define their own custom types in that language.

I demonstrate this in the reference implementation by adding cplx() type support for go as a custom type. Then people are free to come up with their own encodings for their custom needs (one could specify in the schema how to decode them). I think there's enough there as-is to support most custom needs.

> maybe it would make sense to create a small set of core types (kind of like a standard library), and then permit extensions via user-defined types which must be whitelisted?

I thought about that, but the complexity grows fast, and then you have a constellation of "conformant" codecs that have different levels of support, which means you can now only count on the minimal set of required types and the rest are useless. The fewer optional parts, the better.

travisjungroth · on Jan 1, 2022

EDN has some really good ideas in it. Here's the main spec: https://github.com/edn-format/edn

The Learn X in Y Minutes: https://learnxinyminutes.com/docs/edn/

A related talk by Rich Hickey that I think you'd find interesting: https://www.youtube.com/watch?v=ROor6_NGIWU

For a schema, I'd start with what CUE has done. The idea of types that constrain down as a lattice + a separate default path really resonates with me. https://cuelang.org/

goodpoint · on Jan 1, 2022

Does it support IDL and zero-copy access? That's a must for safe and fast parsing and general ease of use.

kstenerud · on Jan 2, 2022

Zero-copy access is supported for primitive and array types (int & float arrays, string types) provided the array was sent as a single chunk (multi-chunk is an exceptional case). "structs" cannot be zero-copy in an ad-hoc format (if you need that, something like protobufs is a better choice).

IDL would be a level higher than the encoding layer, so yes you could use this as the encoding layer for an IDL construct.

lmilcin · on Jan 1, 2022

Have you seen ASN.1?

kstenerud · on Jan 1, 2022

Yes, it's included in the comparison matrix: https://github.com/kstenerud/concise-encoding#-compared-to-o...

no_circuit · on Jan 2, 2022

Local protobuf user here. Appreciate seeing a comparison chart. :-) It's unfortunate that it isn't documented very well, but Protobuf does have a text format [1] which I've used a lot, usually when writing test cases, but also when inspecting logs. Similar to the CBE encoder spec [2], it does use variable length encoding for ints [3] and preserves the type information. Another efficiency item to compare against different message types is the implementation itself, e.g. memory arenas out of the box. [4]

Regarding CE, what would be the use case? APIs, data at rest, inter-service communications? If data at rest meant for analysis, then there probably are a handful more formats to compare against.

If one doesn't wish to decode the whole message into memory to read it, FlatBuffers [5] can be checked out which is also supported as a message type in gRPC. It is similar to what is used in some trading systems. There is also a FlexBuffers variation if you'd want something closer to JSON/BSON.

Must say however, I found it cool that you have some Mac/iOS GitHub repos. Definitely going take some time to check them out -- I used to develop iOS apps.

[1] https://developers.google.com/protocol-buffers/docs/referenc...

[2] https://github.com/kstenerud/concise-encoding/blob/master/cb...

[3] https://developers.google.com/protocol-buffers/docs/proto3#s...

[4] https://developers.google.com/protocol-buffers/docs/referenc...

[5] https://google.github.io/flatbuffers/flatbuffers_white_paper...

kstenerud · on Jan 2, 2022

CE's primary focuses beyond security are ease-of-use and low-friction, which is what made JSON ubiquitous:

- Simple to understand and use, even by non-technical people (the text format, I mean).

- Low friction: no extra compilation / code generation steps or special tools or descriptor files needed.

- Ad-hoc: no requirement to fully define your data types up front. Schema or schemaless is your choice (people often avoid schemas until they become absolutely necessary).

Other formats support features like partial reads, zero-copy structs, random access, finite-time decoding/encoding, etc. And those are awesome, but I'd consider them specialized applications with trade-offs that only an experienced person can evaluate (and absolutely SHOULD evaluate).

CE is more of a general purpose tool that can be added to a project to solve the majority of data storage or transmission issues quickly and efficiently with low friction, and then possibly swapped out for a more specialized tool later if the need arises. "First, reach for CE. Then, reach for XYZ once you actually need it."

This is a partially-solved problem, but the existing solutions are security holes due to under-specification (causing codec behavior variance), missing types (requiring custom secondary - and usually buggy - codecs), and lack of versioning (so the formats can't be updated). And security is fast becoming the dominant issue nowadays.

memling · on Jan 3, 2022

An interesting project!

Regarding some of the ASN.1 comparison characteristics, I'm not quite sure if I understand--there's a lot to read here, and it's likely I've missed something by a lack of acquaintance with your documents/specifications. But a couple comments:

- Cyclic data: ASN.1 supports recursive data structures.[0]

- Time zones: ASN.1 supports ISO 8601 time types, including specification of local or UTC time.[1] I'm not sure how else you might manage this, but perhaps it's not what you mean?

- Bin + txt: Again, I'm unclear on what you mean here, but ASN.1 has both binary and text-based encodings (X.693 for XML encoding rules[2], X.697 for JSON[3], and an RFC for generic string encoding rules[4]; compilers support input and output).

- Versioned: Also a little unclear to me--it seems like the intent is to capture the version of data sent across the wire relative to the schema used in its creation or else that it ties the encoding to the notation/encoding specification. ASN.1 supports extensibility (the ellipsis marker, ...[5]) and versioning,[6] but AFAIK there's nothing that forces a DER-encoded document to describe whether it's from the first release or the newest. Relative to security, it also supports various canonical encodings.

[0]: https://www.obj-sys.com/asn1tutorial/node19.html and X.680 3.8.61.

[1]: https://www.itu.int/rec/T-REC-X.680-X.693-202102-I/en (see X.680 §38 and Annex J.2.11)

[2]: https://www.itu.int/rec/T-REC-X.693/en -- X.694 governs interoperability between XSD and ASN.1 schema.

[3]: https://www.itu.int/rec/T-REC-X.697/en

[4]: https://datatracker.ietf.org/doc/rfc3641/

[5]: See X.680 3.8.41

[6]: See X.680 §3.8.95

kstenerud · on Jan 4, 2022

> - Cyclic data: ASN.1 supports recursive data structures.

Not sure if I missed something, but the link was talking about self-referential types, not self-referential data. For example (in CTE):

    &a:{
        "recursive link" = $a
    }

In the above example, `&a:` means mark the next object and give it symbolic identifier "a". `$a` means look up the reference to symbolic identifier "a". So this is a map whose "recusive link" key is a pointer to the map itself. How this data is represented internally by the receiver of such a document (a table, a dictionary, a struct, etc) is up to the implementation, but the intent is for a structure whose data points to itself.

> - Time zones: ASN.1 supports ISO 8601 time types, including specification of local or UTC time.

Yes, this is the major failing of ISO 8601: They don't have true time zones. It only uses UTC offsets, which are a bad idea for so many reasons. https://github.com/kstenerud/concise-encoding/blob/master/ce...

> - Bin + txt: Again, I'm unclear on what you mean here, but ASN.1 has both binary and text-based encodings

Ah cool, didn't know about those.

> - Versioned: Also a little unclear to me

The intent is to specify the exact document formatting that the decoder can expect. For example we could in theory decide to make CBE version 2 a bit-oriented format instead of byte-oriented in order to save space at the cost of processing time. It would be completely unreadable to a CBE 1 decoder, but since the document starts with 0x83 0x02 instead of 0x83 0x01, a CBE 1 decoder would say "I can't decode this" and a CBE 2 decoder would say "I can decode this".

With documents versioned to the spec, we can change even the fundamental structure of the format to deal with ANYTHING that might come up in future. Maybe a new security flaw in CBE 1 is discovered. Maybe a new data type becomes so popular that it would be crazy not to include it, etc. This avoids polluting the simpler encodings with deprecated types (see BSON) and bloating the format.