Revisiting Apple Notes (6): The Protobuf

dunham · on Sept 21, 2020

If the author is here, thanks for writing this up. I did something similar years ago, but I neglected to write it up. My process was essentially the same as yours, although I used some thrown-together python code to analyze the protobuf data.

After I finished, I found out that the raw .proto files were embedded in the web app, which would have have saved a bit of time, but also taken away some of the fun. So if you want the official schema, you can look there, but it might not be legal to redistribute as-is.

Also, if anyone is poking around with protobuf forensics with ObjectiveC code, I once threw together a script to reverse engineer protobuf definitions from ObjectiveC executables. I haven't tested it recently, but it should work for ObjetiveC code generated by protoc (i.e. if _PBDataWriterWriteInt shows up in nm output):

https://gist.github.com/dunhamsteve/224e26a7f56689c33cea4f0f...

thesnarky1 · on Sept 22, 2020

I was not here, but wanted to create an account to thank you for the work you did years ago. Your work was really critical to my understanding of tables and I wish I had found it sooner!

I keep meaning to go back and add in the drawing code, but since I can export the fallback image directly, I just use that for now in the parser.

Glad you liked the article and really glad to see you specifically comment on it. Thank you!

dunham · on Sept 22, 2020

That project sent me down a rabbit hole learning about CRDTs. I'd love to learn more about the history of the Notes implementation. Who did it? What background did they have? Their "topotext" construct is similar to RGA, but different from any paper that I can find. I think it was invented by the developers.

The drawing thing I did because it was interesting - the kind of data that would be fun to have to play around with character / shape recognition. I export it as SVG for my "notes2html". But since my output doesn't take into account the pen angle, pressure, and velocity, it isn't going to be as good as Apple's PNG.

The actual raw data for the stroke is a series of PKCompressedStrokePoint structures that contain timestamp, position, radius, aspectRatio, edgeWidth, force, azimuth (scaled radians), altitude, and opacity.

I had to look at disassembled libraries to figure that part out. Peeking at PencilKit again with Hopper makes me think that the structure layout may have changed since I wrote that code.

alynn · on Sept 21, 2020

I never would have thought to check the webapp. Grepping for loadProto or .proto" does it. Thanks for sharing.

jeremiahlee · on Sept 21, 2020

Specifically, loadProto in https://www.icloud.com/applications/notes3/2017Project42/en-... looks promising.

jeffbee · on Sept 21, 2020

The author says "field 2 is clearly a message type" but does not support this deduction. Why did they conclude the second field was a message type rather than just a string? Strings and messages are isomorphic on the wire.

scottlamb · on Sept 21, 2020

> Why did they conclude the second field was a message type rather than just a string? Strings and messages are isomorphic on the wire.

Presumably its contents. The included protobuf-inspector output shows it's a valid submessage, including length-delimited fields having valid lengths. A two-deep string matches the expected title. The tag numbers are also in order, which adds plausibility. (Tags aren't required to be in a particular order but IIRC the stock implementations in most languages output them in ascending tag number order for known fields.)

This is pretty strong evidence. If it's not strong enough for you, reverse engineering might not be your thing. That's okay.

Nit: message is isomorphic with bytes, not string. string has the additional restriction of being valid UTF-8. I didn't check, but I wouldn't be surprised if field 2 is not a valid string.

jeffbee · on Sept 21, 2020

Right but the fact that a bytes can be decoded as a message doesn't mean that's how it's declared in the proto. Sometimes a message will have `bytes opaquemsg = 42;` when an application wants to proxy through a nested protocol message, but without caring about the contents. What I'm getting at is the article has arrived at one of many possible proto files that could be the one the app actually uses.

scottlamb · on Sept 21, 2020

They're reverse engineering the program from its effects (the files it reads and writes). It's not possible to exactly duplicate the original program's source code in that manner, nor is it necessary. When there are two indistinguishable approaches—in this case representing something as a message in the .proto vs representing it as bytes and later decoding it separately as a message—it makes more sense to choose the simpler way.

thesnarky1 · on Sept 22, 2020

You're absolutely right that this is not supposed to be a 100% correct representation of the protobuf, which I tried to capture in a few places, including the conclusion.

There's some hand-wavy magic in this article, which was written months after doing the actual work. One aspect of that is I made a lot of test data and while yes, it could be any number of things, dealing with Occam's razor meant as I saw what looked like a message type (and not just a byte sequence which represented many other potential protobufs) I went with that as the conclusion. That conclusion seemed to hold as I test many other, more complicated notes generated from different types of devices. Hence why I present it here as a message.

But, yes, you're absolutely right that this is not the protobuf definition used by Apple (which is copyrighted) and there is certainly information missing from it. My goal was simply to be able to recover what I see on the Apple device's screen.

lostdog · on Sept 22, 2020

It's been interesting to see how many places protocol buffers are now used by. I'm guessing a large fraction of companies use them internally without being too public about it, which is unfortunate because I've often shied away from using them as a Google-only thing.

It's remarkable that nothing better has replaced them, since they're a little old and crufty, but I guess most of the web startup community was happy enough with JSON. Plus, it's still hard to tell if proto2 or proto3 is the right version to build on long term.

halflings · on Sept 22, 2020

The biggest issue in proto3 is the missing "has_<field>" functions (e.g no distinction between a field's default value, and a special <missing> state).

There are suggested workarounds for this, but I'd just stick with proto2 for the time being as proto3 is still WIP.

criddell · on Sept 21, 2020

Whenever I read about protocol buffers, it reminds me of Microsoft's OLE libraries. IIRC, MS Word's file format is (or was) a bunch of OLE streams.

mumblemumble · on Sept 21, 2020

You may be thinking of the Compound File Binary Format[1], which is used by OLE applications.

OLE itself is an API mechanism, not a serialization format. So when an application that speaks OLE finds a chunk in a compound file that needs to be handled by another application, it would use OLE to go ask that application for help.

Compound formats, in and of themselves, aren't really all that special. One could make a case that this is a common use case for fairly prosaic technologies like XML and MIME, too, for example.

[1]: https://docs.microsoft.com/en-us/openspecs/windows_protocols...

tomcam · on Sept 23, 2020

Excellent and accurate description. Username totally unjustified.

criddell · on Sept 21, 2020

Yes! That's what I was thinking of.

cryptonector · on Sept 21, 2020

MS-RPC NDR IDL, ONC/RPC XDR, ASN.1, and many, many others. There aren't too many variations.

You can have "self-describing" formats, which is generally short for tag-length-value (TLV) formats where the tag identifies the type of the encoded value, and the value is an octet string of the given length encoding a value of the given type tag. This includes ASN.1's BER/DER/CER. Protobufs is a TLV encoding.

Or you can have a more "binary" format with no tags and lengths elided for fields with fixed-sized encodings or fields whose variable-length value encodings are self-describing as to length (e.g., various varint formats). This includes NDR, XDR, ASN.1's PER and OER. Flatbuffers is a variation on this theme.

Or you can have textual encodings like XML and JSON.

There's not a lot of other options.

Each has its pros and cons, but, really, the packed encoding type (NDR, XDR, PER, OER, flatbuffers) is generally superior. That's because TLV schemes are redundant and therefore bloated, and because you really need to have the schema at hand (or have code generated from the schema, which is a different way of having the schema at hand) no matter what, so "self-describing" just isn't terribly useful. If you've ever seen the output of "asn1print" type programs you'll understand that you still need to know the schema in your head to understand the output, so might as well have a dumper/pretty-printer that can use a schema to really pretty-print...

Textual encodings generally end up getting binary / optimized variants. E.g., XML has FastInfoSet (which is essentially a transcoding to ASN.1 with PER!). JSON has many binary formats, including several "JSONB" schemes (e.g., PostgreSQL's), and CBOR. This is because at the end of the day, textual formats are hard to parse -- they really suck. No one needs to build a SIMD parser for NDR/XDR/PER/OER/flatbuffers/CBOR/JSONB, but JSON is so hard to parse quickly that SIMD parsers for it exist.

Besides the many standards, I've seen quite a few ad-hoc schemes, such as Lustre's RPC format (basically flatbuffers, but without an IDL compiler and code generator), many Perl5 DataDumper-type modules, Lisp S-expressions (the grand-daddy of data encoding formats!), and many others.

In a few years we might see a new round of bad reinventions of this wheel. This happens because inexperienced people find that scheme X "sucks" or there's no tooling for scheme X in the hot new programming language yet, and rather than read the specs for X and implement their own tooling they go off and make up a new scheme as they implement it.

ASN.1 gets a lot of hate, but mostly the issue is with the TLV-type encoding rules, but there are many more encoding rules than just BER (and the DER and CER variants of BER): there's binary-style non-TLV ERs like PER and OER, there's XER (XML!), JER (JSON!), and even GSER (a textual encoding just for LDAP), and anyone can create new encoding rules if they really want to. Another reason ASN.1 gets hated on is lack of tooling, but every time anyone invents a new scheme... they have to write new tooling anyways -- what a waste.

Note: I'm not saying we should only be using ASN.1 or anything of the sort. I'm saying we need to really understand what came before so we don't make the same mistakes again. Most importantly we need to understand that the T and often the L in TLV are just a lame crutch that bloats data on the wire and makes it easy to hand-roll bug-riddled codecs.

Const-me · on Sept 22, 2020

> There's not a lot of other options.

There's one I like the most, Microsoft .NET Binary XML. Here's a decade-old overview: http://const.me/articles/net-tcp/

Has unlimited extensibility and doesn't need schema /protocol to decode. May include typed values not just strings from XML, makes serializers/deserializers much faster compared to text/xml, no need to print or parse numbers. Has easy way to convert to text/xml, helps with debugging and diagnostics. All protocol elements are byte-aligned, with some experience it's possible to understand even without decoding into text.

The only downside, hard to implement outside .NET. Still, in 2020 all editions of .NET runtime support that thing, and these implementations are compatible across platforms.

cryptonector · on Sept 21, 2020

Don't miss some of the commentary below about zero-copy encodings.

Also, the ITU-T specs for ASN.1 are some of the best-written specs I've ever had to read and use. ASN.1 may seem dated and ugly, but its world-class specs are a real advantage, IMO.

wahern · on Sept 21, 2020

I've written zero-copy implementations (as much as was possible) for DNS, MIME, RTSP/RTP, SMTP, and others. Locality is far more important to performance. Generic zero-copy APIs often have too much indirection, and for short messages or messages with many distinct fields, generating all those pointers can cause more data churn than simple copies. And while zero-copy makes the most sense for messages with large blobs, if you have to buffer the entire message you're again losing data locality, whereas a streaming architecture can be even more performant even with copies (and even because of the copying) when you can keep your working set small and hot.

IME, there's simply no getting around the fact that if you want better performance, efficiency, and/or latency, you have to design your entire stack holistically. A library promising a generic API with a supposedly generic "optimization" is often already suboptimal, and architecting an application with that mindset (e.g. trying to push all the hard parts behind a generic getter/setter API) makes the suboptimality grow non-linearly.

I agree about ASN.1. The ASN.1 specifications are well written and everything is very well thought out (even though it's not always obvious why without the benefit of hindsight). Though, for a long while they weren't freely available, officially or unofficially. Thankfully that's changed (not sure precisely when that happened), but it cost them a lot in terms of mindshare and was a big reason, I suspect, for the horrible open source tooling both then and now.

joshuamorton · on Sept 21, 2020

Just gonna add this here: https://news.ycombinator.com/item?id=23589117 (mostly for the followup discussion, see for example https://news.ycombinator.com/item?id=23589357).

There's tradeoffs. "Packed encoding type is generally superior" for some workflows (write-once serialization across the wire), but certainly not across the board.

cryptonector · on Sept 21, 2020

There are tradeoffs, but I think binary, non-TLV encodings are superior. TLV encodings are notoriously difficult to encode but add no real value on the decode side. With TLV encodings you get a measure of implied extensibility that requires explicit support in binary, non-TLV encodings. But it turns out that you want to have explicit extensibility even in TLV cases anyways, since it helps the code generator, and for security reasons.

Even ignoring the zero-copy advantages of flatbufs and Cap'n Proto, TLV is bloaty and redundant, and redundancy is one mother of serious bugs in hand-rolled codecs.

Perhaps the best example I have of trade-offs is PostgreSQL's JSONB, which is optimized to be compressible and fast to traverse. To my knowledge that's the only binary JSON encoding that does that. (Specifically, it encodes lengths of array and object elements, and of object keys, but every N it encodes an offset, and the lengths are of the value encodings which are themselves concatenated. This allows similar lengths to be compressible, while the never-repeating offsets are not but also they're few and far between, and it also allows for fast indexing.)

joshuamorton · on Sept 21, 2020

> TLV is bloaty and redundant, and redundancy is one mother of serious bugs in hand-rolled codecs.

This is an argument against hand-generating TLV binary packets, not an argument against using generated code (as in protobuf).

> Even ignoring the zero-copy advantages of flatbufs and Cap'n Proto

See kenton's comment I linked, the zero-copy can be a disadvantage in write-many applications, since if you have a variable-length field (a string or submessage) and you can end up with super inefficient binary encodings.

kentonv · on Sept 21, 2020

> zero-copy can be a disadvantage in write-many applications

To clarify, the disadvantage stems from the fact that to get zero-copy, you really need all the parts of a particular message to be in contiguous memory (or at least a small number of contiguous segments), which creates a memory allocator design challenge: now instead of one huge heap you have many small heaps and fragmentation potentially wastes space on the wire. So far I've made no attempt to solve this problem in Cap'n Proto; new objects are simply always added to the end of the message, which works fine for write-once and append-only scenarios but not for general mutable structures.

I think this is an orthogonal issue to the TLV vs. fixed offsets debate.

cryptonector · on Sept 21, 2020

Right, this is a problem, but it's orthogonal to TLV, but it's TLV I was arguing against, not necessarily for cap'n proto.

You can use OER without having to allocate all sub-parts of the structure from one flat buffer, with no "edits" that fragment the buffer, and no memmoves, and you won't get zero-copy, but you'll still get a more efficient codec than any TLV encoding. I just don't see any case where TLV is appealing, but TLV is appalling :)

I think we're on the same page.

EDIT: BTW, thanks for the clarification -- I thought that was the issue you meant, but what you linked to was lengthy and I didn't want to have to wade through all of it.

EDIT: Also, even if zero-copy schemes are inefficient for some uses on the encoder side, they can still be very efficient on the decode side, as you need essentially no memory allocations to decode (though you still need to traverse the encoded data).

joshuamorton · on Sept 21, 2020

> I think this is an orthogonal issue to the TLV vs. fixed offsets debate

I agree, I don't quite think they're related. But I think some of the original claim that tlv is worse is based on the potential for zero copy. That is advantageous in some common situations, but it's not always the case.

cryptonector · on Sept 21, 2020

> This is an argument against hand-generating TLV binary packets, not an argument against using generated code (as in protobuf).

Or maybe it's an argument against the kind of designs that tempt devs to hand-code them.

> See kenton's comment I linked, the zero-copy can be a disadvantage in write-many applications, since if you have a variable-length field (a string or submessage) and you can end up with super inefficient binary encodings.

I don't buy that as a counter argument to the point about TLV's disadvantages relative to PER/OER and flatbuffers/cap'n proto. You might prefer OER to cap'n proto in some cases, or vice versa, but not TLV to either.

athenot · on Sept 21, 2020

What I do appreciate about TLV-style formats is general-purpose debugging tools. When all you have is a bitstream where field lengths are not multiples of 8 bits and are variable based on the actual value being encoded (looking at you, MPEG), the toolset to debug streams ends up being non-trivial.

cryptonector · on Sept 21, 2020

If the IDL compiler is more like a library then you can have an "asn1print" that takes a schema and dumps the data correctly with all the metadata you need and want and none of the crappy TLV.

svachalek · on Sept 21, 2020

Haven’t used protobufs for years but as I remember it you can use an old schema to read a newer version object, discarding the new fields. I don’t think that would be possible without the TLV.

cryptonector · on Sept 21, 2020

This is called "extensibility".

In ASN.1 people were doing the same thing unofficially in the mid-80s, then explicit support was added in the form of "extensibility markers".

There's nothing new in protobufs.

EDIT: I should add that the non-TLV encoding rules for ASN.1 all support extensibility. Also, there are good engineering practices and security reasons why one should want explicit documentation of what structures/enums/choices/typeranges are expected to be extensible. Relying on decoders to ignore unexpected bits is not a sound practice.

mumblemumble · on Sept 21, 2020

flatbuffers also support this, but use offset tables instead of TLV.

It may be that TLV is necessary if you want to do this in a format that is designed for streaming rather than random access.

evntdrvn · on Sept 21, 2020

Nice summary!! :)

cryptonector · on Sept 21, 2020

Thanks!

pantulis · on Sept 21, 2020

I reminds me of CORBA's IDL.

dekhn · on Sept 21, 2020

And Sun XDR.

pjmlp · on Sept 21, 2020

Now it just gets serialized as a bunch of XML files.

However COM is pretty much alive and is the foundation of UWP.

tasogare · on Sept 21, 2020

Pretty much alive and UWP, chose one.

p_l · on Sept 21, 2020

Pretty much all new Windows APIs for maybe two decades now are done through COM, including DirectX.

pjmlp · on Sept 21, 2020

Another one that doesn't get that after Longhorn, its ideas were reimplemented as COM and bare bones Win32 is frozen in XP style.

mevorah · on Sept 22, 2020

One question I’ve had about Protobufs is, what about function definitions? If Protobufs handle my objects, do I have to use a separate tool for my API?

oezi · on Sept 21, 2020

Did I miss it or did the author also discuss the strokes created with the Apple pencil?

thesnarky1 · on Sept 22, 2020

You did not miss it. There is some other, helpful, prior work in that subject by another author: https://github.com/dunhamsteve/notesutils/blob/master/notes....

sigzero · on Sept 21, 2020

That was a fascinating breakdown.

herodotus · on Sept 21, 2020

If I understand the idea of protobufs correctly, the fundamental point is to start with a serialization specification and then generate reader/writer code for that spec in the programming language of your choice.

If that is correct, it is very similar to the idea of pattern matching in functional programming compilers. See for example: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25...

The fundamental idea is that when you have a function definition like f(<pattern>) <- ..... instead of a runtime binary match(<pattern>,<expression>), the compiler generates a function (lets call it fmp) such that fmp(<expression>) = match(<pattern>,<expression>) for any <expression>

Ever since I first used pattern matching compilation when I worked on the Hope compiler, I always though that the idea of precompiling static function parameters (or slowly changing ones) was a very under-used idea.

cryptonector · on Sept 21, 2020

Protobufs is just yet another structured data serialization format. There's nothing special to it other than "it's not one of the others". It's a bad reinvention of the wheel though. That's why there's new and better things like flatbuffers. The most important thing to take away from protobufs is that in a field full of literature and prior art, one should not ignore it.

mumblemumble · on Sept 21, 2020

For the curious, here's my hot take on the details:

Protocol buffers, being older, are far and away the more widely supported format. If you're looking for something like a wire format for RPC, or usability from basically any common language is important to you, protobuf will be the path of least resistance. However, there are some design warts to work around. The most notable one (IMO) is that varints make dealing with numbers a rather delicate subject.

Flatbuffers are similar, but made some design decisions allow for a zero-copy implementation. This can be good for performance. Overall, flatbuffers have always struck me as being promising, but language and tooling support isn't nearly as comprehensive, which has always kept me from giving them a serious try at work.

mumblemumble · on Sept 21, 2020

A much later thought, that perhaps speaks to the idea that one size does not fit all:

The Apache Arrow project, which is the most ambitious project I know of for sharing data among systems in a performant way, ended up going with a mix of options. Actual data is stored in a project-specific format. When the data are serialized (e.g., to disk or for transport over the wire), headers, schemata, and suchlike are formatted using FlatBuffers. The RPC protocol itself uses Protobuf for the messages.

kortex · on Sept 22, 2020

Protobufs aren't perfect, but flatbuffers aren't the holy grail, either. I've found the Protobuf ecosystem way better documented (and yet still lacking), and way more market penetration (due to grpc, no doubt). I've found flatbuf's api ergonomics absolutely dreadful. The python flatbuf library re-parses bytes on each access, which kind of defeats the purpose of zero copy. Maybe it's better in other langs, but python was where I was hoping to reap performance gains and didn't see them.