Writing To A Binary Stream In C/C++: Harder Than It Should Be

Suppose you want to write some code to communicate using a binary protocol. You would think C (or C++) was a natural language for this. After all, low level data manipulation is one of it’s strengths. But it’s surprisingly hard to get this right — and surprisingly easy to write code that works now, but might stop working in the future.

Here’s a simplified version of the header:

struct request_header {
    uint8_t magic;  // Always 0x80
    uint8_t opcode; // e.g. get, set, add, etc.
    uint16_t key_length;
    uint32_t body_length;

How do you format a packet to send? Here’s the most straightforward way:

void write_get_header(uint8_t *buffer, uint16_t key_length)
    request_header *header = reinterpret_cast(buffer);
    header->magic = 0x80;
    header->opcode = 0x00; // GET
    header->key_length = key_length;
    header->body_length = key_length;

This is exactly what memcached itself does. The problem is, it’s not guaranteed to do what we want, because it uses undefined behavior. The reinterpret_cast is valid (if buffer happens to be aligned to the alignment of request_header, more on this later), but dereferencing the pointer leads to undefined behavior. Note that casting the other direction — casting an existing request_header to uint8_t * — lets you dereference, which is handy when you’re reading from the wire. But when formatting data to send, you’re out of luck.

On x86 with common compilers (g++, clang++, I assume MSVC) this compiles, runs, and does the right thing. So what’s the problem? If you change optimization settings, or upgrade you compiler, it could stop working, because according to the C++ language spec, the compiler is allowed to generate code that does just about anything, including going into an infinite loop or reformatting your hard drive. And this stuff can actually happen. There are contests for Craziest Compiler Output due to Undefined Behavior, some of the source files look quite reasonable, yet don’t do what you expect. Here’s a good introduction, and What Every C Programmer Should Know About Undefined Behavior.

If the compiler won’t warn you, even with -Wall -Wextra, and the code runs just fine, how can you detect undefined behavior? In general there’s no reliable way. Valgrind finds some of it, but because it only works on the compiled output, it won’t find this. Clang has an “undefined behavior sanitizer,” which doesn’t catch the reinterpret_cast, although it does catch the unaligned writes (see below).

But low level data manipulation like this are strengths for C and C++. So there must be a standard idiom for this, right?

The memcached server itself has this problem when it prepares to send replies. It uses the “union” of the struct with a char array of the same size:

union {
    struct {
        uint8_t magic;  // Always 0x80
        uint8_t opcode; // e.g. get, set, add, etc.
        uint16_t key_length;
        uint32_t body_length;
    } request;
    uint8_t bytes[8];
} request_header;

The idea is to take your uint8_t *, cast it to a request_header *, then write to it through the field. That’s supported in C99, but unfortunately not in C++.

And even in C, it’s only supported with caveats. The struct can have padding, and the amount of padding is up to the compiler and the target architecture.  This doesn’t lead to undefined behavior, but it does lead to incorrect behavior: putting data in the wrong place.  If you restrict yourself to processors with byte addressing and power-of-two word size you’re probably ok, because of the way the fields are laid out: the total size of all fields before the uint16_t, assuming no padding, are a multiple of 16 bits, and before the uint32_t, 32 bits.

To tell the compiler you don’t want padding, you can use #pragma pack But now you’re not using ISO standard C, meaning your code isn’t portable to other architectures. And if #pragma pack actually changes the layout of your struct, accesses to it are unaligned, pretty much by definition. What’s the problem with that?

Even on x86_64, uint16_t has an alignment of 2, which means it’s address needs to be even. If we start our header at an odd address, because the previous request had an odd-lengthed key, we’ll be writing through an odd address which is — you guessed it — undefined behavior, and on some popular architectures can be hundreds of times slower, or can simply seg fault. And even on x86, a compiler is free to assume that the lower order bit of the pointer is zero and store other data there if it likes.

In practice, on x86 such writes turn into movl assembly instructions, which will do the right thing with un-aligned addresses. But again, we’d rather not have to rely in intimate knowledge of the compiler and processor. We’d like to say what we mean in valid C/C++, so the compiler can exploit all that. C/C++ is designed for low level stuff like creating network messages, right? So what’s the right idiom?

The official way is to use memcpy() and trust the compiler to optimize it away:

void write_get_header_with_memcpy(uint8_t *buffer, uint16_t key_length)
    uint8_t magic = 0x80;
    memcpy(buffer, &magic, sizeof(uint8_t));  // Not needed for uint8_t, but I like the consistency.

    uint8_t opcode = 0x00;
    memcpy(buffer+1, &opcode, sizeof(uint8_t));  // Not needed for uint8_t, but I like the consistency.

    uint16_t key_length_network_order = htons(key_length);
    memcpy(buffer+2, &key_length_network_order, sizeof(uint16_t));

    uint32_t body_length_network_order = htonl(key_length);
    memcpy(buffer+4, &body_length_network_order, sizeof(uint32_t));

Because memcpy needs an address, we need to create temporaries, so every line turns into two. If we’re not worried about padding problems we can use offsetof(), otherwise we need to compute the offsets ourselves. This is pretty verbose and error prone.

Helper functions can make things a little better, e.g.:

void write_uint16_t(uint8_t *buffer, uint16_t value)
    int16_t value_network_order = htons(value);
    memcpy(buffer, &value_network_order, sizeof(uint16_t);

And you can introduce constants for the offsets, to make things a little more readable. But it’s still harder to read than the struct version.

You could also write the entire header in a local variable, then memcpy() the entire thing to the buffer, and hope the compiler optimizes out the copy. I tried this, and in simple example code, it worked! But in production code, with argument checks and returning the address of the next byte to write to, neither gcc nor clang were able to optimize away the memcpy().

So in the end, people mostly use the struct, assume there won’t be any extra padding as long as the total size of all fields before a given one is a multiple of the field’s size, with the union (most often, because it works in C) or the reinterpret_cast, and that writing to unalinged memory won’t cause any problems. This is exactly what memcached itself does.

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Writing To A Binary Stream In C/C++: Harder Than It Should Be

  1. Alex says:

    The extremely interesting thing is that both gcc and clang are indeed capable of optimizing the htons/memcpy soup!

    gcc for example produces this:

    write_get_header_with_memcpy(unsigned char*, unsigned short):
    movl %esi, %eax
    movb $-128, (%rdi)
    movzwl %si, %esi
    rorw $8, %ax
    bswap %esi
    movb $0, 1(%rdi)
    movw %ax, 2(%rdi)
    movl %esi, 4(%rdi)

    And if you trim it down and remove the htons/htonl to have the same logic – the versions emit the same code. Amazing.

  2. Kat Marsen says:

    Why on earth would you need to reinterpret_cast when writing??? To write, you could simply do this:

    request_header h;
    h.magic = 0x80;
    h.opcode = 0x0;
    h.key_length = key_length;
    h.body_length = body_length;
    file.write(&h, sizeof(h));

    If, for some reason, you’ve been handed a buffer which is typed for some bizarro reason as uint8* and you want to fill it in, you still don’t need a cast!

    void* place = uint8_buffer_ptr;
    request_header* ph = new (place) request_header;
    request_header& h = *ph;
    … fill it in & write it …

    This is placement-new, and it is the way that most standard library collections initialize existing allocations on-demand. Technically the uint8_buffer_ptr should just be a default-signed char*, and then there is no undefined behavior. Also the placement-new is silly of course in this case because the constructor is trivial.

    Note that in the former example, the stack scoped use, everything is aligned properly. If some poorly-designed architecture gives you a uint8* to fill in, and hasn’t bothered to handle alignment for you, then yes, you should fill in the stack-scoped version and then

    request_header h;
    //fill it in
    memcpy(stupid_buffer_from_stupid_framework, &h, sizeof(h));

    • Thanks for commenting Kat! Calling file.write() involves an extra copy of the data to an internal buffer, which can significantly affect performance. Also, if you call the write(2) system call, that’s a context switch from user to kernel mode, which takes a really long time, so you want to do as few of those as possible. So you want to fill in your own buffers as much as possible.

      As for placement new, that’s interesting, I hadn’t thought of that. But if the pointer doesn’t have the same alignment as the struct, then it’s an unaligned write and undefined behavior, as far as I know. I think you acknowledge as much when you say “and hasn’t bothered to handle alignment for you, then yes, you should fill in the stack-scoped version and then [memcpy].”

      The extra memcpy can take significant time, one of the goals of work like this is to minimize the number of copies, hopefully just the one copy from user to kernel space.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s