If the author is here, thanks for writing this up. I did something similar years ago, but I neglected to write it up. My process was essentially the same as yours, although I used some thrown-together python code to analyze the protobuf data.
After I finished, I found out that the raw .proto files were embedded in the web app, which would have have saved a bit of time, but also taken away some of the fun. So if you want the official schema, you can look there, but it might not be legal to redistribute as-is.
Also, if anyone is poking around with protobuf forensics with ObjectiveC code, I once threw together a script to reverse engineer protobuf definitions from ObjectiveC executables. I haven't tested it recently, but it should work for ObjetiveC code generated by protoc (i.e. if _PBDataWriterWriteInt shows up in nm output):
I was not here, but wanted to create an account to thank you for the work you did years ago. Your work was really critical to my understanding of tables and I wish I had found it sooner!
I keep meaning to go back and add in the drawing code, but since I can export the fallback image directly, I just use that for now in the parser.
Glad you liked the article and really glad to see you specifically comment on it. Thank you!
That project sent me down a rabbit hole learning about CRDTs. I'd love to learn more about the history of the Notes implementation. Who did it? What background did they have? Their "topotext" construct is similar to RGA, but different from any paper that I can find. I think it was invented by the developers.
The drawing thing I did because it was interesting - the kind of data that would be fun to have to play around with character / shape recognition. I export it as SVG for my "notes2html". But since my output doesn't take into account the pen angle, pressure, and velocity, it isn't going to be as good as Apple's PNG.
The actual raw data for the stroke is a series of PKCompressedStrokePoint structures that contain timestamp, position, radius, aspectRatio, edgeWidth, force, azimuth (scaled radians), altitude, and opacity.
I had to look at disassembled libraries to figure that part out. Peeking at PencilKit again with Hopper makes me think that the structure layout may have changed since I wrote that code.
The author says "field 2 is clearly a message type" but does not support this deduction. Why did they conclude the second field was a message type rather than just a string? Strings and messages are isomorphic on the wire.
> Why did they conclude the second field was a message type rather than just a string? Strings and messages are isomorphic on the wire.
Presumably its contents. The included protobuf-inspector output shows it's a valid submessage, including length-delimited fields having valid lengths. A two-deep string matches the expected title. The tag numbers are also in order, which adds plausibility. (Tags aren't required to be in a particular order but IIRC the stock implementations in most languages output them in ascending tag number order for known fields.)
This is pretty strong evidence. If it's not strong enough for you, reverse engineering might not be your thing. That's okay.
Nit: message is isomorphic with bytes, not string. string has the additional restriction of being valid UTF-8. I didn't check, but I wouldn't be surprised if field 2 is not a valid string.
Right but the fact that a bytes can be decoded as a message doesn't mean that's how it's declared in the proto. Sometimes a message will have `bytes opaquemsg = 42;` when an application wants to proxy through a nested protocol message, but without caring about the contents. What I'm getting at is the article has arrived at one of many possible proto files that could be the one the app actually uses.
They're reverse engineering the program from its effects (the files it reads and writes). It's not possible to exactly duplicate the original program's source code in that manner, nor is it necessary. When there are two indistinguishable approaches—in this case representing something as a message in the .proto vs representing it as bytes and later decoding it separately as a message—it makes more sense to choose the simpler way.
You're absolutely right that this is not supposed to be a 100% correct representation of the protobuf, which I tried to capture in a few places, including the conclusion.
There's some hand-wavy magic in this article, which was written months after doing the actual work. One aspect of that is I made a lot of test data and while yes, it could be any number of things, dealing with Occam's razor meant as I saw what looked like a message type (and not just a byte sequence which represented many other potential protobufs) I went with that as the conclusion. That conclusion seemed to hold as I test many other, more complicated notes generated from different types of devices. Hence why I present it here as a message.
But, yes, you're absolutely right that this is not the protobuf definition used by Apple (which is copyrighted) and there is certainly information missing from it. My goal was simply to be able to recover what I see on the Apple device's screen.
It's been interesting to see how many places protocol buffers are now used by. I'm guessing a large fraction of companies use them internally without being too public about it, which is unfortunate because I've often shied away from using them as a Google-only thing.
It's remarkable that nothing better has replaced them, since they're a little old and crufty, but I guess most of the web startup community was happy enough with JSON. Plus, it's still hard to tell if proto2 or proto3 is the right version to build on long term.
The biggest issue in proto3 is the missing "has_<field>" functions (e.g no distinction between a field's default value, and a special <missing> state).
There are suggested workarounds for this, but I'd just stick with proto2 for the time being as proto3 is still WIP.
You may be thinking of the Compound File Binary Format[1], which is used by OLE applications.
OLE itself is an API mechanism, not a serialization format. So when an application that speaks OLE finds a chunk in a compound file that needs to be handled by another application, it would use OLE to go ask that application for help.
Compound formats, in and of themselves, aren't really all that special. One could make a case that this is a common use case for fairly prosaic technologies like XML and MIME, too, for example.
MS-RPC NDR IDL, ONC/RPC XDR, ASN.1, and many, many others. There aren't too many variations.
You can have "self-describing" formats, which is generally short for tag-length-value (TLV) formats where the tag identifies the type of the encoded value, and the value is an octet string of the given length encoding a value of the given type tag. This includes ASN.1's BER/DER/CER. Protobufs is a TLV encoding.
Or you can have a more "binary" format with no tags and lengths elided for fields with fixed-sized encodings or fields whose variable-length value encodings are self-describing as to length (e.g., various varint formats). This includes NDR, XDR, ASN.1's PER and OER. Flatbuffers is a variation on this theme.
Or you can have textual encodings like XML and JSON.
There's not a lot of other options.
Each has its pros and cons, but, really, the packed encoding type (NDR, XDR, PER, OER, flatbuffers) is generally superior. That's because TLV schemes are redundant and therefore bloated, and because you really need to have the schema at hand (or have code generated from the schema, which is a different way of having the schema at hand) no matter what, so "self-describing" just isn't terribly useful. If you've ever seen the output of "asn1print" type programs you'll understand that you still need to know the schema in your head to understand the output, so might as well have a dumper/pretty-printer that can use a schema to really pretty-print...
Textual encodings generally end up getting binary / optimized variants. E.g., XML has FastInfoSet (which is essentially a transcoding to ASN.1 with PER!). JSON has many binary formats, including several "JSONB" schemes (e.g., PostgreSQL's), and CBOR. This is because at the end of the day, textual formats are hard to parse -- they really suck. No one needs to build a SIMD parser for NDR/XDR/PER/OER/flatbuffers/CBOR/JSONB, but JSON is so hard to parse quickly that SIMD parsers for it exist.
Besides the many standards, I've seen quite a few ad-hoc schemes, such as Lustre's RPC format (basically flatbuffers, but without an IDL compiler and code generator), many Perl5 DataDumper-type modules, Lisp S-expressions (the grand-daddy of data encoding formats!), and many others.
In a few years we might see a new round of bad reinventions of this wheel. This happens because inexperienced people find that scheme X "sucks" or there's no tooling for scheme X in the hot new programming language yet, and rather than read the specs for X and implement their own tooling they go off and make up a new scheme as they implement it.
ASN.1 gets a lot of hate, but mostly the issue is with the TLV-type encoding rules, but there are many more encoding rules than just BER (and the DER and CER variants of BER): there's binary-style non-TLV ERs like PER and OER, there's XER (XML!), JER (JSON!), and even GSER (a textual encoding just for LDAP), and anyone can create new encoding rules if they really want to. Another reason ASN.1 gets hated on is lack of tooling, but every time anyone invents a new scheme... they have to write new tooling anyways -- what a waste.
Note: I'm not saying we should only be using ASN.1 or anything of the sort. I'm saying we need to really understand what came before so we don't make the same mistakes again. Most importantly we need to understand that the T and often the L in TLV are just a lame crutch that bloats data on the wire and makes it easy to hand-roll bug-riddled codecs.
Has unlimited extensibility and doesn't need schema /protocol to decode. May include typed values not just strings from XML, makes serializers/deserializers much faster compared to text/xml, no need to print or parse numbers. Has easy way to convert to text/xml, helps with debugging and diagnostics. All protocol elements are byte-aligned, with some experience it's possible to understand even without decoding into text.
The only downside, hard to implement outside .NET. Still, in 2020 all editions of .NET runtime support that thing, and these implementations are compatible across platforms.
Don't miss some of the commentary below about zero-copy encodings.
Also, the ITU-T specs for ASN.1 are some of the best-written specs I've ever had to read and use. ASN.1 may seem dated and ugly, but its world-class specs are a real advantage, IMO.
I've written zero-copy implementations (as much as was possible) for DNS, MIME, RTSP/RTP, SMTP, and others. Locality is far more important to performance. Generic zero-copy APIs often have too much indirection, and for short messages or messages with many distinct fields, generating all those pointers can cause more data churn than simple copies. And while zero-copy makes the most sense for messages with large blobs, if you have to buffer the entire message you're again losing data locality, whereas a streaming architecture can be even more performant even with copies (and even because of the copying) when you can keep your working set small and hot.
IME, there's simply no getting around the fact that if you want better performance, efficiency, and/or latency, you have to design your entire stack holistically. A library promising a generic API with a supposedly generic "optimization" is often already suboptimal, and architecting an application with that mindset (e.g. trying to push all the hard parts behind a generic getter/setter API) makes the suboptimality grow non-linearly.
I agree about ASN.1. The ASN.1 specifications are well written and everything is very well thought out (even though it's not always obvious why without the benefit of hindsight). Though, for a long while they weren't freely available, officially or unofficially. Thankfully that's changed (not sure precisely when that happened), but it cost them a lot in terms of mindshare and was a big reason, I suspect, for the horrible open source tooling both then and now.
There's tradeoffs. "Packed encoding type is generally superior" for some workflows (write-once serialization across the wire), but certainly not across the board.
There are tradeoffs, but I think binary, non-TLV encodings are superior. TLV encodings are notoriously difficult to encode but add no real value on the decode side. With TLV encodings you get a measure of implied extensibility that requires explicit support in binary, non-TLV encodings. But it turns out that you want to have explicit extensibility even in TLV cases anyways, since it helps the code generator, and for security reasons.
Even ignoring the zero-copy advantages of flatbufs and Cap'n Proto, TLV is bloaty and redundant, and redundancy is one mother of serious bugs in hand-rolled codecs.
Perhaps the best example I have of trade-offs is PostgreSQL's JSONB, which is optimized to be compressible and fast to traverse. To my knowledge that's the only binary JSON encoding that does that. (Specifically, it encodes lengths of array and object elements, and of object keys, but every N it encodes an offset, and the lengths are of the value encodings which are themselves concatenated. This allows similar lengths to be compressible, while the never-repeating offsets are not but also they're few and far between, and it also allows for fast indexing.)
> TLV is bloaty and redundant, and redundancy is one mother of serious bugs in hand-rolled codecs.
This is an argument against hand-generating TLV binary packets, not an argument against using generated code (as in protobuf).
> Even ignoring the zero-copy advantages of flatbufs and Cap'n Proto
See kenton's comment I linked, the zero-copy can be a disadvantage in write-many applications, since if you have a variable-length field (a string or submessage) and you can end up with super inefficient binary encodings.
> zero-copy can be a disadvantage in write-many applications
To clarify, the disadvantage stems from the fact that to get zero-copy, you really need all the parts of a particular message to be in contiguous memory (or at least a small number of contiguous segments), which creates a memory allocator design challenge: now instead of one huge heap you have many small heaps and fragmentation potentially wastes space on the wire. So far I've made no attempt to solve this problem in Cap'n Proto; new objects are simply always added to the end of the message, which works fine for write-once and append-only scenarios but not for general mutable structures.
I think this is an orthogonal issue to the TLV vs. fixed offsets debate.
Right, this is a problem, but it's orthogonal to TLV, but it's TLV I was arguing against, not necessarily for cap'n proto.
You can use OER without having to allocate all sub-parts of the structure from one flat buffer, with no "edits" that fragment the buffer, and no memmoves, and you won't get zero-copy, but you'll still get a more efficient codec than any TLV encoding. I just don't see any case where TLV is appealing, but TLV is appalling :)
I think we're on the same page.
EDIT: BTW, thanks for the clarification -- I thought that was the issue you meant, but what you linked to was lengthy and I didn't want to have to wade through all of it.
EDIT: Also, even if zero-copy schemes are inefficient for some uses on the encoder side, they can still be very efficient on the decode side, as you need essentially no memory allocations to decode (though you still need to traverse the encoded data).
> I think this is an orthogonal issue to the TLV vs. fixed offsets debate
I agree, I don't quite think they're related. But I think some of the original claim that tlv is worse is based on the potential for zero copy. That is advantageous in some common situations, but it's not always the case.
> This is an argument against hand-generating TLV binary packets, not an argument against using generated code (as in protobuf).
Or maybe it's an argument against the kind of designs that tempt devs to hand-code them.
> See kenton's comment I linked, the zero-copy can be a disadvantage in write-many applications, since if you have a variable-length field (a string or submessage) and you can end up with super inefficient binary encodings.
I don't buy that as a counter argument to the point about TLV's disadvantages relative to PER/OER and flatbuffers/cap'n proto. You might prefer OER to cap'n proto in some cases, or vice versa, but not TLV to either.
What I do appreciate about TLV-style formats is general-purpose debugging tools. When all you have is a bitstream where field lengths are not multiples of 8 bits and are variable based on the actual value being encoded (looking at you, MPEG), the toolset to debug streams ends up being non-trivial.
If the IDL compiler is more like a library then you can have an "asn1print" that takes a schema and dumps the data correctly with all the metadata you need and want and none of the crappy TLV.
Haven’t used protobufs for years but as I remember it you can use an old schema to read a newer version object, discarding the new fields. I don’t think that would be possible without the TLV.
In ASN.1 people were doing the same thing unofficially in the mid-80s, then explicit support was added in the form of "extensibility markers".
There's nothing new in protobufs.
EDIT: I should add that the non-TLV encoding rules for ASN.1 all support extensibility. Also, there are good engineering practices and security reasons why one should want explicit documentation of what structures/enums/choices/typeranges are expected to be extensible. Relying on decoders to ignore unexpected bits is not a sound practice.
One question I’ve had about Protobufs is, what about function definitions? If Protobufs handle my objects, do I have to use a separate tool for my API?
If I understand the idea of protobufs correctly, the fundamental point is to start with a serialization specification and then generate reader/writer code for that spec in the programming language of your choice.
The fundamental idea is that when you have a function definition like
f(<pattern>) <- .....
instead of a runtime binary match(<pattern>,<expression>), the compiler generates a function (lets call it fmp) such that
fmp(<expression>) = match(<pattern>,<expression>) for any <expression>
Ever since I first used pattern matching compilation when I worked on the Hope compiler, I always though that the idea of precompiling static function parameters (or slowly changing ones) was a very under-used idea.
Protobufs is just yet another structured data serialization format. There's nothing special to it other than "it's not one of the others". It's a bad reinvention of the wheel though. That's why there's new and better things like flatbuffers. The most important thing to take away from protobufs is that in a field full of literature and prior art, one should not ignore it.
For the curious, here's my hot take on the details:
Protocol buffers, being older, are far and away the more widely supported format. If you're looking for something like a wire format for RPC, or usability from basically any common language is important to you, protobuf will be the path of least resistance. However, there are some design warts to work around. The most notable one (IMO) is that varints make dealing with numbers a rather delicate subject.
Flatbuffers are similar, but made some design decisions allow for a zero-copy implementation. This can be good for performance. Overall, flatbuffers have always struck me as being promising, but language and tooling support isn't nearly as comprehensive, which has always kept me from giving them a serious try at work.
A much later thought, that perhaps speaks to the idea that one size does not fit all:
The Apache Arrow project, which is the most ambitious project I know of for sharing data among systems in a performant way, ended up going with a mix of options. Actual data is stored in a project-specific format. When the data are serialized (e.g., to disk or for transport over the wire), headers, schemata, and suchlike are formatted using FlatBuffers. The RPC protocol itself uses Protobuf for the messages.
Protobufs aren't perfect, but flatbuffers aren't the holy grail, either. I've found the Protobuf ecosystem way better documented (and yet still lacking), and way more market penetration (due to grpc, no doubt). I've found flatbuf's api ergonomics absolutely dreadful. The python flatbuf library re-parses bytes on each access, which kind of defeats the purpose of zero copy. Maybe it's better in other langs, but python was where I was hoping to reap performance gains and didn't see them.
After I finished, I found out that the raw .proto files were embedded in the web app, which would have have saved a bit of time, but also taken away some of the fun. So if you want the official schema, you can look there, but it might not be legal to redistribute as-is.
Also, if anyone is poking around with protobuf forensics with ObjectiveC code, I once threw together a script to reverse engineer protobuf definitions from ObjectiveC executables. I haven't tested it recently, but it should work for ObjetiveC code generated by protoc (i.e. if _PBDataWriterWriteInt shows up in nm output):
https://gist.github.com/dunhamsteve/224e26a7f56689c33cea4f0f...