Adventures in Binary Serialisation

The other day I came across an interesting question¹ - instead of faffing around with serde and dealing with the compile-time hits, can I just write the binary data from RAM to a file? This is just an exploration of that problem. I thought this would be simple, but I ended up finding some interesting UB and digging into how Vec allocates just a little bit.

If you find any factual errors with this article - I’ll happily take responsibility, edit it and provide appropriate credit. I’ve tried my best and that’s kinda all I can give. See contact details here.

Because, at least for now, this is just an experience in having fun I’ll be using whatever structs please me² without any real focus on real-world usage. I’ll also be expecting some intermediate to advanced knowledge on programming because I’ll be skipping over some basic explanations. This will also be specific to Rust and other manual-memory management languages.

Plain-Old-Data⌗

Firstly, I have to work out some data to store, so I’ll start with a simple struct:

struct ExampleData {
    a: u8,
    b: u64,
    c: [u8; 7],
    d: u32,
}

fn example_data () -> ExampleData {
    ExampleData {
        a: 0,
        b: u64::MAX,
        c: [10, 11, 12, 13, 14, 15, 16],
        d: 123456789
    }
}

I’ve carefully chosen the types and their positions (but I’ll get more into that later), but all that matters for now is that they’re all stored on the stack³.

Serialising⌗

Guess what? We’re all done here, because the compiler does all the work! The struct must be somewhere in memory and we just need to work out how to trick the compiler into letting us see the struct as just a series of bytes.

After some real soul-searching⁴, I managed to find this:

unsafe fn any_as_u8_slice<T: Sized>(p: &T) -> &[u8] {
    ::core::slice::from_raw_parts(
        (p as *const T) as *const u8,
        ::core::mem::size_of::<T>(),
    )
}

This takes a reference to a Rust object (which has a finite and known-at-compile-time size), and creates a slice of bytes starting at the address of the reference with the length being the length of the struct. This looks sound to me as we know that the slice will only be valid for as long as the struct (because of lifetimes which are elided here), and we’re only accessing memory that we know belongs to the struct.

If we just print this to stdout, we see something really interesting that rust does that lots of other languages don’t:

[255, 255, 255, 255, 255, 255, 255, 255, 21, 205, 91, 7, 0, 10, 11, 12, 13, 14, 15, 16, 0, 0, 0, 0]

I deliberately picked some quite distinctive values for the example data - we can see the 8 255s for the u64::MAX, the [21, 205, 91, 7] for our u32, the 0 for the u8, the [10, 11, 12, 13, 14, 15, 16] for the u8 array. However, there are two main problems here - firstly, the struct’s been reordered and there’s also an extra 4 0s on the end?

The first is a consequence of Rust not having a stable ABI - that is to say that the compiler reserves every right to screw around with the internal representation of anything and everything if it works out it can pack it more efficiently. The second is a consequence of padding - unless we tell it otherwise, the compiler always tries to align the length of the struct to the ~~word length of the PC running the code (here, 8 bytes for a 64-byte word)~~ correction: alignment is actually usually determined by the maximum length of variables inside - here, a u64 so 8 bytes. This just allows reads to be far more efficient for reasons I’m not 100% sure on - I just chalk it up to one of those things that we deal with in exchange for lightning rock magic. correction: some CPU instructions also require alignment for being able to work, not just performance reasons

We can fix the first by adding a little #[repr(C)] to just before our struct declaration which tells the compiler to follow the C style which preserves ordering. However, this then balloons our struct to 32 bytes rather than just 24. The only reason you’d do so would be if you were trying to do some form of FFI, but there are far better solutions.

However, if you run this through miri, you’ll see that this function exhibits undefined behaviour. This is because we’re accessing uninitialised memory. At first, you might be confused as to why - it’s the padding. Padding is uninitialised memory and so this function is UB (I’ll discuss UB more later, don’t worry). To fix this, we can instead #[repr(packed)], which removes all padding at a possible cost to performance. That having been said, for future learning opportunities⁵, we’ll pretend that the struct isn’t packed for the rest of this article.

For now, I’ll write those bytes to a file.

use std::{io::Write, fs::File};

fn main () -> std::io::Result<()> {
    let eg = example_data();
    let bytes = unsafe { any_as_u8_slice(&eg) };
    File::create("out.bin")?.write_all(&bytes)?;
    

    Ok(())
}

Deserialising⌗

Then, on this end, I can read those bytes back in to prove I’m using different bytes.

use std::io::Read;

fn read_in () -> std::io::Result<ExampleData> {
    let mut bytes = vec![];

    const SIZE: usize = 16;
    let mut reader = File::open("out.bin")?;
    let mut tmp = [0; SIZE];
    loop {
        match reader.read(&mut tmp)? {
            0 => break,
            n => {
                bytes.extend(&tmp[0..n]);
            }
        }
    }

    todo!("What now?")
}

I now have a vector full of data which consists of a pointer to the beginning of the data and the length of that data⁵. I now need to do the reverse of what I did before but this is not nearly as easy.

Raw pointer dereference⌗

My first instinct here is just to dereference the pointer like this:

let ptr = bytes.as_ptr() as *const ExampleData;
let example = unsafe { *ptr };

We just have to cast it from a *const u8 to a *const ExampleData, which is easy.

The first problem comes from the fact that we can’t dereference the pointer because we’d need to copy out the data. We can fix that with a quick #[derive(Copy, Clone)]. This then appears to work, until we run it through miri and see that the alignment is incorrect and this is undefined behaviour - we’ll have the same problem with the next method so I’ll dive deeper into it there.

Boxing⌗

Let’s delete the derive and try again by thinking about how we deal with pointers in Rust.

Well, we typically do that with a Box, which holds a pointer to a struct on the heap. We usually make one of these with Box::new to get it from the stack to the heap, but if the data is already on the heap, we can use Box::from_raw:

let boxed: Box<ExampleData> = unsafe { Box::from_raw(bytes.as_mut_ptr() as *mut ExampleData) };
return Ok(*boxed);

This is very similar to the way we made a slice earlier, but here we don’t need to specify the size as ExampleData has a constant size. This then also works just fine, because we can dereference the box without needing copy and we’ve got the struct back out from the heap. 🥳🥳

However, if you run this just with cargo run, you’ll see a problem. You won’t see a lovely 0 exit code, and instead we get this: error: process didn't exit successfully: 'target\debug\ser.exe' (exit code: 0xc0000374, STATUS_HEAP_CORRUPTION). This implies that we’ve somehow corrupted the heap, so let’s have a gander at the documentation.

The Rust standard library documentation is incredible for having thorough safety discussions and help for unsafe functions. If we look back at the function documentation, we can see that it has this to say:

Constructs a box from a raw pointer.
After calling this function, the raw pointer is owned by the resulting Box. Specifically, the Box destructor will call the destructor of T and free the > allocated memory. For this to be safe, the memory must have been allocated in accordance with the memory layout used by Box.
Safety
This function is unsafe because improper use may lead to memory problems. For example, a double-free may occur if the function is called twice on the same > raw pointer.

Here, we can now see our error - a classic double-free. When the function exits, bytes will get freed. Since *boxed goes back to main, it’ll get freed at the end there and we’ve now freed the same memory twice which corrupts the heap. Luckily, the fix here is relatively simple - we just need to add a strategic std::mem::forget:

bytes.shrink_to_fit();
let boxed: Box<ExampleData> = unsafe { Box::from_raw(bytes.as_mut_ptr() as *mut ExampleData) };
std::mem::forget(bytes);
return Ok(*boxed);

std::mem::forget basically just tells rust specifically not to run the destructor of the object you give it, therefore not freeing the memory. Here, we use it to make sure that bytes never gets freed. We don’t need to worry about the memory getting leaked because it then gets freed when boxed gets dropped. I’ve also added a Vec::shrink_to_fit call to ensure that the spare memory we likely allocated for the Vec doesn’t get lost.⁶.

correction: the Vec’s allocation may be reused which would then be UB. Instead, use bytes.leak().

UB?⌗

This all seems to work, but if we run it through miri, we can see that this is UB. Specifically: error: Undefined Behavior: constructing invalid value: encountered an unaligned box (required 8 byte alignment but found 2), which happens when we create the Box with Box::from_raw. Here, we get to alignment issues. I briefly wrote about alignment earlier, and the part that matters here is that alignment is inherent to a specific struct. In fact, if we look at the alloc and dealloc methods they take a Layout which stores size and alignment, unlike C’s malloc and free which just take a size.

If we look at what behaviour is considered undefined (helpfully linked to me by miri), we can see (in the second item) that accessing a misaligned pointer is undefined behaviour. I’ll quickly explain for people coming from C the danger of undefined behaviour in Rust. In C, there are lots of behaviours that people use that aren’t specifically designated by the specification (like integer over/under-flow) which are undefined behaviour. They’re mostly fine there, but Rust UB is a different beast entirely because of how heavily the compiler tries to optimise code. Generally in Rust, no UB is good UB, and if you wrote UB then the compiler reserves the right (and often will, especially in release builds) to do whatever the hell it wants regardless of your intentions. Here, it kinda works but in a larger program nobody can know what the compiler will do.

correction: C UB is also very very bad, and I think I just got confused between undefined behaviour and specification-undefined behaviour. the example I mentioned is also wrong - you can’t rely on signed integer over/under-flow, especially in release builds

Alignment⌗

To actually explain the issue, we need to look at the Layout’s of Vec and of ExampleData, which is easy because Layout implements Debug which means we can print it out to get all of its fields.

use std::alloc::Layout;

fn print_layouts () {
    println!("Vec<u8>:     {:?}", Layout::new<Vec<u8>>());
    println!("ExampleData: {:?}", Layout::new::<ExampleData>());
}

Which then outputs that they’re the same?

Vec<u8>:     Layout { size: 24, align: 8 (1 << 3) }
ExampleData: Layout { size: 24, align: 8 (1 << 3) }

The Same?⌗

This definitely stumped me for a while until I decided to have a look at the documentation & source for std::vec::Vec, and then by clicking on Source I saw that it holds a RawVec buffer and the current length. I’ve added the definition below, simplified to remove the standard library shenanigans:

pub struct Vec<T, A: Allocator = Global> {
    buf: RawVec<T, A>,
    len: usize,
}

RawVec is a private class that Vec uses to wrap lots of the unsafe behaviours, and you can’t actually see it in any documentation. However, you can just click around in the source to find it here:

pub(crate) struct RawVec<T, A: Allocator = Global> {
    ptr: Unique<T>,
    cap: usize,
    alloc: A,
}

alloc is a zero-sized type, and then both cap and ptr have size usize which is 8 bytes. If we add up all of the bytes and check our alignments we can then see that yeah it’s 24 bytes with 8 byte-alignment.

To see where our alignment problem from earlier comes from, we have to dig into Vec::push:

pub fn push(&mut self, value: T) {
    if self.len == self.buf.capacity() {
        self.buf.reserve_for_push(self.len);
    }
    unsafe {
        let end = self.as_mut_ptr().add(self.len);
        ptr::write(end, value);
        self.len += 1;
    }
}

We can see that it checks for capacity (and reserves space if we don’t have enough which I’ll get to in just a second), and it then writes to a pointer with the new value. That call for reserving is on self.buf, which is the RawVec from earlier which handles all of the allocation so it’s probably where we want to be looking.

pub fn reserve_for_push(&mut self, len: usize) {
    handle_reserve(self.grow_amortized(len, 1));
}

fn grow_amortized(&mut self, len: usize, additional: usize) -> Result<(), TryReserveError> {
    debug_assert!(additional > 0);
    if T::IS_ZST {
        return Err(CapacityOverflow.into());
    }
    let required_cap = len.checked_add(additional).ok_or(CapacityOverflow)?;

    let cap = cmp::max(self.cap * 2, required_cap);
    let cap = cmp::max(Self::MIN_NON_ZERO_CAP, cap);
    let new_layout = Layout::array::<T>(cap);

    let ptr = finish_grow(new_layout, self.current_memory(), &mut self.alloc)?;
    self.set_ptr_and_cap(ptr, cap);
    Ok(())
}

handle_reserve just deals with errors, and if we peer through grow_amortized which actually allocates (and strategically ignore the amortisation parts and just look for a Layout), we can see that it actually allocates using Layout::array<T>(cap). If we then print out that layout (using the number of bytes we need for storing an ExampleData for the array size):

fn print_layouts () {
    let size = std::mem::size_of::<ExampleData>();
    println!("Vec<u8>:     {:?}", Layout::array::<u8>(size).unwrap());
    println!("ExampleData: {:?}", Layout::new::<ExampleData>());
}

We get this out the other end - success! The alignment is different which then explains why our Box creation earlier was undefined behaviour.

Vec<u8>:     Layout { size: 24, align: 1 (1 << 0) }
ExampleData: Layout { size: 24, align: 8 (1 << 3) }

The obvious next question is whether there’s any way to convert between these two without copying (since that’s why we’ve gotten into this whole mess). That having been said, I’ll take a brief aside to explain how to do this if you’re OK with allocating another 24 bytes.

Copying again (but without `Copy` needed nor any UB)⌗

The following code (according to miri and my own gut feeling⁷) is unsafe, but sound (meaning that it doesn’t cause UB):

return Ok(unsafe {
    let mut ptr = std::alloc::alloc(Layout::new::<ExampleData>());

    for b in bytes {
        unsafe { ptr.write(b) };
        ptr = ptr.add(1);
    }

    ptr = ptr.sub(std::mem::size_of::<ExampleData>());
    *Box::from_raw(ptr as *mut ExampleData)
});

We create a pointer using alloc with the correct layout. We then write each byte and increment the address of the pointer each time. We then take the pointer back to the start and give it to a Box to be managed. That Box is then responsible for freeing the heap memory we allocated and we also use it to get the data out from the heap back to the stack. Don’t ask me how it does that - I can’t tell for the life of me. There’s probably a better way to do it, but I can at least say that this won’t cause memory leaks, memory corruption or compiler mayhem.

Is there a no-copy solution?⌗

As far as I can tell, there isn’t a way to tell the Rust compiler that we want to change the alignment of an existing allocation, which then means that (unless the struct has 1-byte alignment) there isn’t a way to do a no-copy solution which annoys me. If you work out a way of doing it - I’d love to hear it.

Unlike the first copying method which required the struct to be Copy and was UB unless the struct was packed, this doesn’t need that and as far as I can tell, it’s sound.

Heap Data⌗

That’s all well and good, but what about Heap data like a Vector or a String?

By itself, you’d have to know at compile-time the type, but you could probably get away with just writing the contents of the pointer, and carefully reading them back. However, you’d have to be careful around cases where more space was allocated than used which would be annoying.

The main complications would come from if that heap data was stored inside a struct. You’d no longer be able to just pretend it was all just a series of bytes, and you’d have to put in lots of manual work to not accidentally just write a memory address that would become invalid the second the program finished⁸, and to read back the correct number of bytes. At that point, you’re better off just using something like protobuf or serde.

Closing Thoughts⌗

Whilst yes, this does work I would never recommend it for any production use where you couldn’t control it very finely. The main problem comes from what I briefly mentioned earlier about the Rust ABI not being stable. If the compiler updates, even if you don’t change your code, the in-memory representation of your struct could change and then you’d never have any way of recovering that written data other than downgrading your compiler version to ‘before it broke’. I’d also be careful about manually writing bytes by hand instead of outputting an existing struct - Rust does some really interesting optimisations which could lead to bit patterns being something completely different to what you expected - like Option<bool> which only takes 1 byte. It’s also undefined behaviour for certain types to be in invalid bit states (eg. 0x0 and 0x1 are fine for a bool, but 0x3 is invalid) because they’re often used for those optimisations.

As to why the UB never actually caused any issues, I’ve not a clue. There’s probably someone who could explain more but I’m not that person. Reading padding is undefined behaviour, and I couldn’t say why it didn’t cause any issues here.

Whilst you could make arguments around the use of #[repr(C)] to safeguard against this, there’s another problem - there’s no description of what this data is if anyone’s poking around the filesystem. Most file formats start with something to mark what format they are, but there’s nothing here if you’re trying to read data later. Even if you don’t go as far as to use a self-describing format like JSON, most serialisation methods (as far as I can tell) at least make it clear that they’re storing data. With this, unless you’re a rare programmer who actually writes thorough documentation, I can’t even tell if its a binary or a file.

I’ll finish off by saying that this article was also a bit NIH and if you want to actually do things like this, I’d have a gander at bytemuck.

Anyways, that was fun to explore and try and get a better knowledge of UB and binary formats. Enjoy the new year!

Full final code can be viewed here.

errata⌗

I posted a link to this on the subreddit here and received some comments - here are my responses.

u/dkopgerpgdolfg⌗

This user pointed out one thing which makes me feel like a fool - if we know the length of our struct beforehand (which we do - it’s std::mem::size_of::<ExampleData>()), we can just read straight into the pointer:

fn read_in_straight_to_pointer (file_name: &str) -> std::io::Result<Option<ExampleData>> {
    unsafe {
        const STRUCT_SIZE: usize = std::mem::size_of::<ExampleData>();
	    const LAYOUT: Layout = Layout::new::<ExampleData>();

        let mut ptr = std::alloc::alloc(LAYOUT); //create a pointer to the heap struct
        let mut bytes_read_total = 0; //create a counter for how many bytes we've written - this is like the `len` of a Vec, where `STRUCT_SIZE` is the `cap`

        let mut reader = File::open(file_name)?; //open a file
        let mut tmp = [0; 32]; //create a stack for temporary reads - very little performance impact

        loop {
            match reader.read(&mut tmp)? {
                0 => break,
                n => {
                    let n = n.min(STRUCT_SIZE - bytes_read_total); //ensure we don't write past the memory we allocated
                    for i in 0..n {
                        ptr.write(tmp[i]);
                        ptr = ptr.add(1);
                    }

		            bytes_read_total += n;

                    if bytes_read_total == STRUCT_SIZE { //if there's more to read, we don't care
                        break;
                    }
                }
            }
        }

        if bytes_read_total < STRUCT_SIZE {
            std::alloc::dealloc(ptr, LAYOUT); //if we didn't read enough, then we need to remember to deallocate the block, as Rust won't do it for us
            Ok(None)
        } else {
            ptr = ptr.sub(STRUCT_SIZE);
            Ok(Some(*Box::from_raw(ptr as *mut ExampleData)))
        }
    }
}

I’ve also added corrections from them in the rest of the article relating to alignment, leaking the vec and C undefined behaviour.

u/hniksic⌗

I’ve added corrections relating to C undefined behaviour.

At least, interesting to me ;) ↩︎
and some semblance of a structure for you delightful readers. ↩︎
That is, they’re all in memory in one place without any pointers or memory addresses. ↩︎
That is, duckduckgo-ing for StackOverflow. ↩︎
That is, I didn’t realise this properly until I’d written far more of this article (the future bits are also super interesting). If you’ve read more of the article, this is why: if you pack the struct the alignment becomes 1 byte (because of the u8), which then means that it has the same alignment as the vec which makes the initial solution (using Box::from_raw with the bytes.as_mut_ptr() as *mut ExampleData) not unsound any more. ↩︎ ↩︎
When the Rust vector reallocates, it doesn’t just add one to the capacity, it goes up in a pattern which is designed to amortise the allocation cost to be as low as possible. Because the Box only knows about the length, it won’t know to free the memory that’s in the capacity but not the length; ↩︎
100% correct 60% of the time ↩︎
Technically, you could modify the ExampleData from this post with any number of heap-based structures and it might still work. If the memory allocator didn’t de-allocate between you writing out the data (and dropping the old struct), and reading the memory address back I think it’d work? The problem would come from if you wrote on one run and read on another because that memory that it’s pointing to would no longer exist. ↩︎