Qstruct


Doug Hoyte

Binary Serialisation Scheme

Alternatives

There are many other similar formats, all have their strengths and weaknesses:

Design objectives of Qstruct

Design anti-objectives of Qstruct

Primitive Types

Schema/Perl Usage

Schema:
    qstruct Account {
      id @0 uint64;
      balance @1 double;
    }

    qstruct User {
      id @0 uint64;
      username @1 string;
      email @3 string;
      accounts @2 Account[];
    }
  
Perl usage:
my $msg = User->encode({
            id => 389,
            username => 'jimmy',
            email => 'jimmy@example.com',
            accounts => [
              { id => 811, balance => 23.3 },
              { id => 131, balance => -900 },
            ]
          });

my $user = User->decode($msg);

say "User id is " . $user->id;

foreach (@{ $user->accounts }) {
  $balance += $_->balance;
}
  

Message Layout

Example message: Header and Body

my $msg = User->encode({ name => "hello world!", id => 100,
                         is_admin => 1, is_locked => 1, });
  
    00000000  00 00 00 00 00 00 00 00  20 00 00 00 01 00 00 00  |........ .......|
              |---------------------header-------------------|

    00000010  64 00 00 00 00 00 00 00  03 00 00 00 00 00 00 00  |d...............|
              |--------id (@0)------|  || |----free space----|
                                       ||
                                       |->is_admin|is_locked (@1|@3)

    00000020  0c 68 65 6c 6c 6f 20 77  6f 72 6c 64 21 00 00 00  |.hello world!...|
              || |--------inline string data--------| |--pad-|
              ||
              |-> name (@2) tag byte indicating length of inline string
  

The Heap

If a string is too large to fit in a tagged-size, it must be stored on the heap (same for blobs/dynamic-size arrays):
    HDR:  00000000  00 00 00 00 00 00 00 00  20 00 00 00 01 00 00 00  |........ .......|
    CONT: 00000010  64 00 00 00 00 00 00 00  03 00 00 00 00 00 00 00  |d...............|
    CONT: 00000020  00 18 00 00 00 00 00 00  30 00 00 00 00 00 00 00  |........0.......|
                    |-------size << 8-----|  |----start offset-----|
    HEAP: 00000030  74 6f 6f 20 6c 6f 6e 67  20 66 6f 72 20 74 61 67  |too long for tag|
    HEAP: 00000040  67 65 64 20 73 69 7a 65                           |ged size|
  

Arrays

Ragel


      /* Here is the format we are parsing: */
      qstruct User {
        id @0 uint64;
        active @4 bool;
        name @2 string;
        email_addrs @3 string[]; # dynamic array (pointer-based)
        sha256_checksum @1 uint8[32]; # fixed array (inlined in body)
        accounts @5 Account[]; # array of nested Account qstructs
      }
  

Zero-Copy

Due to perl's return-value semantics, this copies the field data:
my $jpg = $user->profile_image;
  
With "decoded" qstruct objects there is an alternative zero-copy accessor:
$user->profile_image(my $jpg);
  
Especially with larger data, copying is bad because it takes time and it puts often needless pressure on your memory system especially if you are only interested in a portion of the data. Example: the profile image is interlaced/progressive.

Perl 5.20 adds copy-on-write string copies/slices which can accomplish roughly the same thing

Qstruct::ArrayRef

When you extract an array from a qstruct, it actually returns a magic Qstruct::ArrayRef. This acts like an array but is often more efficient. For example, this is lazy because it only accesses the first element:
    say "First acct id: " . $user->accounts->[0]->id;
  
And even though this loop accesses each element, it avoids copying unnecessary data out of each one:
    foreach my $p (@{ $user->profiles }) {
      push @profiles, $p->id; ## Not touching $p->image_data
    }
  

Zero-Copy Arrays

Just as field accessors have zero-copy versions, so do array references:
    $user->emails->get(0, my $first_email);
  
And zero-copy iteration:
    $user->emails->foreach(sub {
      say "Email: ", $_[0];
    });
  

Perl Implementation

Qstruct::Compiler

An example accessor

static QSTRUCT_INLINE int qstruct_get_uint64(char *buf, size_t buf_size,
                                             uint32_t body_index, uint32_t byte_offset,
                                             uint64_t *output) {
  QSTRUCT_GETTER_PREAMBLE(8)

  if (exceeds_bounds) {
    *output = 0; // default value
  } else {
    QSTRUCT_LOAD_8BYTE_LE(buf + actual_offset, output);
  }

  return 0;
}
  
And used like so (64-bit unsigned int, 0th body, 24 byte offset):
int ret = qstruct_get_uint64(buf, buf_size, 0, 24, &my_uint64);
  

A compiled accessor

   0x0000000000400590 <+0>:     cmp    $0xf,%rsi
   0x0000000000400594 <+4>:     jbe    0x4005b1
   0x0000000000400596 <+6>:     mov    0xc(%rdi),%ecx
   0x0000000000400599 <+9>:     mov    0x8(%rdi),%edx
   0x000000000040059c <+12>:    test   %ecx,%ecx
   0x000000000040059e <+14>:    je     0x4005b1
   0x00000000004005a0 <+16>:    xor    %eax,%eax
   0x00000000004005a2 <+18>:    cmp    $0x1f,%edx
   0x00000000004005a5 <+21>:    jbe    0x4005b1
   0x00000000004005a7 <+23>:    cmp    $0x17,%rsi
   0x00000000004005ab <+27>:    jbe    0x4005b1
   0x00000000004005ad <+29>:    mov    0x28(%rdi),%rax
   0x00000000004005b1 <+33>:    repz retq
  
QSTRUCT_GETTER_PREAMBLE: Bounds checking and default values — I'm planning on optimising this. Also, if you access many values, you will only need to do preamble once.

QSTRUCT_LOAD_8BYTE_LE: The actual value loading is one instruction:
0x28 == 40 == 16 (header) + 24 (field offset)

Testing

Conclusion



Questions?