Thank you for the detailed and comprehensive explanations!
> There are actually a LOT of [security concerns], which I try to cover in the security section
If you'd like to eventually harden the binary implementations, you might also be interested in coverage-guided fuzz testing which feeds random garbage data to a method to try and find errors in it: https://llvm.org/docs/LibFuzzer.html
as well as maybe some kind of optional checksum or digital signature to ensure that the payload has not been tampered with (although perhaps this should be performed in another higher layer of the stack).
> make cyclic data disallowed by default, just to cover the common case where people don't use it and don't even want to think about it.
Yes, I think that making it an option which is restrictive (safe) by default would be a great idea. Or perhaps separating out the more dynamic types (eg: graphs, markup, binary data) to be loadable modules could also reduce the default attack surface area.
> You certainly shouldn't open an internet connected service that accepts the text format as input (except maybe during development and debugging...)
Yes, I fully agree with this! I initially assumed that the text format could be sent from an untrusted client similar to JSON and XML, but this makes more sense.
> because text parsing/encoding is supposed to be the 0.0001% use case
I see, so the main use case of the CTE text format is rapid prototyping, and then the user should convert to the CBE binary format in production?
> It's not super easy to add new types to the text format grammar
Customizable types could be a really great way to differentiate from other serialization protocols. I did notice that the system allows the user to define custom structs which is quite useful.
Another approach would be to embed the grammar and parser into an existing language like Python, Rust, or Haskell, and let the user define their own custom types in that language. In my experience, custom types help prevent a lot of errors (eg: for a fitness tracker IoT application, you could define separate types for ip_v4 address, duration_milliseconds, temperature_celsius, heart_rate_beats_per_minute, blood_pressure_mm_hg for systolic and diastolic blood pressure rather than using just floating point or fixed-point numbers, and this could prevent many potential unit conversion and incorrect variable use errors at compile-time). Or you could better model your domain with custom types (eg: reuse the global coordinate datastructure from the timezones implementation to create path or polygon types using repeated coordinates).
> adding new types should be done with EXTREME CARE
maybe it would make sense to create a small set of core types (kind of like a standard library), and then permit extensions via user-defined types which must be whitelisted? But pursuing that route could end up addressing a very different niche (favoring a stricter schema) in the design space.
> The major downside of ANTLR of course is the terrible error reporting.
This is a major advantage of the parser combinator approach, in that it is possible to design them to emit very helpful and context-aware error messages, for example look at the examples at the end of: https://www.quanttec.com/fparsec/users-guide/customizing-err...
Anyway, hope this was useful and I wish you good luck with your project!
> If you'd like to eventually harden the binary implementations, you might also be interested in coverage-guided fuzz testing which feeds random garbage data to a method to try and find errors in it: https://llvm.org/docs/LibFuzzer.html
Yes, I plan to fuzz the hell out of the reference implementation once it's done. So much to do, so little time...
> I see, so the main use case of the CTE text format is rapid prototyping, and then the user should convert to the CBE binary format in production?
CTE would be for prototyping, initial data loads, debugging, auditing, logging, visualizing, possibly even for configuration (since the config would be local and not sourced from unknown origin). Basically: CBE when data passes from machine to machine, and CTE only where a human needs to get involved.
> Another approach would be to embed the grammar and parser into an existing language like Python, Rust, or Haskell, and let the user define their own custom types in that language.
I demonstrate this in the reference implementation by adding cplx() type support for go as a custom type. Then people are free to come up with their own encodings for their custom needs (one could specify in the schema how to decode them). I think there's enough there as-is to support most custom needs.
> maybe it would make sense to create a small set of core types (kind of like a standard library), and then permit extensions via user-defined types which must be whitelisted?
I thought about that, but the complexity grows fast, and then you have a constellation of "conformant" codecs that have different levels of support, which means you can now only count on the minimal set of required types and the rest are useless. The fewer optional parts, the better.
> There are actually a LOT of [security concerns], which I try to cover in the security section
If you'd like to eventually harden the binary implementations, you might also be interested in coverage-guided fuzz testing which feeds random garbage data to a method to try and find errors in it: https://llvm.org/docs/LibFuzzer.html
as well as maybe some kind of optional checksum or digital signature to ensure that the payload has not been tampered with (although perhaps this should be performed in another higher layer of the stack).
> make cyclic data disallowed by default, just to cover the common case where people don't use it and don't even want to think about it.
Yes, I think that making it an option which is restrictive (safe) by default would be a great idea. Or perhaps separating out the more dynamic types (eg: graphs, markup, binary data) to be loadable modules could also reduce the default attack surface area.
> You certainly shouldn't open an internet connected service that accepts the text format as input (except maybe during development and debugging...)
Yes, I fully agree with this! I initially assumed that the text format could be sent from an untrusted client similar to JSON and XML, but this makes more sense.
> because text parsing/encoding is supposed to be the 0.0001% use case
I see, so the main use case of the CTE text format is rapid prototyping, and then the user should convert to the CBE binary format in production?
> It's not super easy to add new types to the text format grammar
Customizable types could be a really great way to differentiate from other serialization protocols. I did notice that the system allows the user to define custom structs which is quite useful.
Another approach would be to embed the grammar and parser into an existing language like Python, Rust, or Haskell, and let the user define their own custom types in that language. In my experience, custom types help prevent a lot of errors (eg: for a fitness tracker IoT application, you could define separate types for ip_v4 address, duration_milliseconds, temperature_celsius, heart_rate_beats_per_minute, blood_pressure_mm_hg for systolic and diastolic blood pressure rather than using just floating point or fixed-point numbers, and this could prevent many potential unit conversion and incorrect variable use errors at compile-time). Or you could better model your domain with custom types (eg: reuse the global coordinate datastructure from the timezones implementation to create path or polygon types using repeated coordinates).
> adding new types should be done with EXTREME CARE
maybe it would make sense to create a small set of core types (kind of like a standard library), and then permit extensions via user-defined types which must be whitelisted? But pursuing that route could end up addressing a very different niche (favoring a stricter schema) in the design space.
> The major downside of ANTLR of course is the terrible error reporting.
This is a major advantage of the parser combinator approach, in that it is possible to design them to emit very helpful and context-aware error messages, for example look at the examples at the end of: https://www.quanttec.com/fparsec/users-guide/customizing-err...
Anyway, hope this was useful and I wish you good luck with your project!