Parsing Binary Data with Node.js

I'll start by highlighting some of the pillars of binary data, hopefully in a breeze. If you find yourself very attracted to these topics, I recommend this book (you can skip the HLA/assembly parts). Also note that it's a bit oldschool (I read that more than 10 years ago but it left quite an impression) so there may be newer and better resources to learn from.

Words

A "computer" word, is a sort of unit of grouping of bits. For example, a word can be 8, 16, 32, 64 etc, bits wide. Typically a word's width is coupled to the CPU's architecture's width (i.e. 64bit CPU) but in our case, we'll treat the meaning of word as "a set of N fixed-size bits" where N is the number of bits.

Endianness

The term "endian" comes from "end". When you look at a sequence of bytes and want to convert a group of bytes to a plain old number, it stands to denote which end of the number is first; in the case of big endian the first part is the bigger one. In the case of little endian the first part is the little one.

For example, there are two ways to look at the couple of bytes appearing in a binary file: 01 23.

| big endian | 01 23 | 0123 | 291 (unsigned) | | little endian | 01 23 | 2301 | 8961 (unsigned) |


Obviously, there is a big difference. So which order is a certain program or file you are currently inspecting use? that is something you must get from the documentation, and there will typically be nothing (excluding some exceptions which are out of this scope) in the file to tell you which way it is.

Wikipedia on Endianness is a great reference for this if you want to know more.

Numbers (signed,unsigned)

To complete the picture, when a number is serialized to binary form (file, packet, etc) it would be represented by N words, and these N words are stored as N*M bytes. So if a number would map onto a word, and a word would be 2 bytes on our file layout, then we need to evaluate the 2 bytes in the context of endianness.

Finally, we also need to decide if the number is signed, or unsigned. I'll kindly leave it as a personal research to you, to see the various ways are there to represent signed numbers in binary data, and what are the most common ones.

Patterns in binary data

Obviously, files or packets that go through the network can represent anything when are looked at with our binary goggles. When you glean over some of the protocols and files out there, you'll find some more common patterns and terminology:

  • Header - It is common to layout the few hundred bytes of a binary file so that you'll have a 'map' of what it's holding in its content. Typically that's called a 'header'.
  • Segments - typically a split of a file to multiple segments to indicate categorization or hierarchy, a segment can be aligned to a convenient number of bytes (e.g 1kb) and then the same type of segment can repeat.
  • Padding - when a block of consecutive bytes ends with an odd number - for example: 7, it is most convinient both in the serializer and deserializer's point of view to give it a little nudge and make it an 8 byte block (last bytes will typically get a 0 value). This goes to show how important is word alignment (a word is typically 16, 32, 64 bits, which is an even number of bytes).

A word about Node.js

First I'd like to get out of the way the fact that I'm not married to Node.js. In other words, I won't go to rediculous lengths to make it work for me, or to make you like it.

In this particular case I had some requirements and Node.js fulfilled those more than happily for me in terms of performance and async behaviour, and left me happy.

Practical binary parsing with Node

Either do it by hand and risk your sanity given a complex enough language (in our case a "language" is a binary format or protocol) or, here are a few things you can use and their Node-ish counterpart:

  • Traditional parser generators (bison, etc) - jison
  • Newer, arguably easier parser generators (PEG parsers) - Canopy
  • Binary data parser (for a lack of a better name) - node-binary, Dissolve
  • Nifty bit-strings (Erlang) - bitsyntax-js

In this case I'll be using Substack's binary data parser node-binary. At the time of the writing I wasn't aware of Dissolve - which looks very similar in terms of API, but is newer and simpler in implementation as it claims; hopefully I get to evaluate it and give a good comparison soon.

Unfortunately the format I was parsing is proprietary, so I can't just paste the format and the solution here and be done :(.

Let's instead make our own binary format, but use common patterns, that I'm sure you'll bump into (check out a PE header for a more elaborate example).

Here's our imaginary file format (size of fields in brackets):

HEADER

| magic [2] | size of header [2] | meta1, [2] | meta2 [2] | start of segment [2] | meta3 [2] |


FIRST SEGMENT

| segment type:content [1] | segment length [2] | segment binary data |


NEXT SEGMENT

| segment type:content [1] | segment length [2] | segment binary data |


LAST SEGMENT (FOOTER)

| segment type:footer [1] | footer-meta1 [2] |


Some important points here:

  • Segments repeat.
  • Last segment is special.
  • Each segment tells you how long it is, and implicitly — these are the number of bytes you need to skip in order to go to the next segment.
  • The header looks like a convenient fixed map of meta data.
  • Magic is something that exist in many files to indicate their type, but most importantly if it doesn't match there's no point to continue parsing.

Using node-binary

Working with node-binary should be straight forward; you should be able to just declare your structure and work a little harder to do recursive or looping structures.

A couple of pitfalls though:

  • Don't nest too much. If you find yourself too nested with loops, taps and the like, try to find another way - perhaps separate into different functions or logical blocks.
  • I really, really, recommend working in small iterations if you're doing it manually. An auto-running test suite should do perfectly if you're serious about it :).

Header is linear and easy

  binary.parse(buf)
        .skip(4) //magic+size of header
        .word16bu('meta1')
        .word16bu('meta2')
        .word16bu('segStart')
        .word16bu('meta3')

Here, word16bu is "a word of 16 bits, (b)ig endian (u)nsigned". Next, segments are looped

 .loop(function(endSeg, vars){
          if(this.eof()){
            endSeg()
          }

          this.word16bu('frameType')
              .word16bu('frameLen')
          .tap(function (vars) {
            //regular segment
            if(vars.frameType == 0xFF01){
              this.skip(vars.frameLen - 4)
            } else if(vars.frameType == 0xFFFF) { //last segment
              console.log("Done.", vars.frameLen);
              endSeg();
            }
          })
        })

I used tap to get into the pipeline and inspect the current populated variables, which helps with taking decisions based on what we've seen so far.

Some byte calculations like vars.frameLen - 4 are very typical of such work, in this instance we had to take off the bytes we've already read, 2 bytes for segment type and 2 for length.

And finally

  binary.parse(buf)
        .skip(4) //magic+size of header
        .word16bu('meta1')
        .word16bu('meta2')
        .word16bu('segStart')
        .word16bu('meta3')
        .loop(function(endSeg, vars){
          if(this.eof()){
            endSeg()
          }

          this.word16bu('frameType')
              .word16bu('frameLen')
          .tap(function (vars) {
            //regular segment
            if(vars.frameType == 0xFF01){
              console.log("Found segment, length:", vars.frameLen)
              this.skip(vars.frameLen - 4)
            } else if(vars.frameType == 0xFFFF) { //last segment
              console.log("Done:", vars.frameLen.toString(16));
              endSeg();
            }
          })
        })
      .tap(function(vars){
        console.dir(vars)
      })

Happily here's the output:

parsing 34 bytes
Found segment, length: 8
Found segment, length: 10
Done: dead
{ meta1: 67,
  meta2: 18,
  segStart: 12,
  meta3: 4369,
  frameType: 65535,
  frameLen: 57005 }

As a closing tip, a very nice thing you get for free from node-binary is a stream for your parser; meaning you can pipe into it instead of reading and parsing an entire file.

Learning More

To see a complete example of node-binary in use, try checking out node-rfb. Within it, @substack implements an RFP protocol for VNC, using the tools I've outlined above. I think reading the code for handshake for example, is very interesting.