Building 2 'Simple' Allocators

I’ve always wondered how allocators work, so over the last week I’ve been working on this.

NB: This is a pretty true-to-life recreation of my journey, so there are twists, turns and unexpected last-minute betrayals but I hope that showing my process of fixing stuff helps more than if I’d just cleaned up the whole article to make it look like I’d got everything perfect on the first try.

Where do I even start?#

All Rust allocators must start with the trait std::alloc::GlobalAlloc:

pub unsafe trait GlobalAlloc {
    // Required methods
    unsafe fn alloc(&self, layout: Layout) -> *mut u8;
    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout);
}

There are two methods that must be provided - one to allocate memory and one to deallocate it. Unlike some other languages, the Rust allocator takes in an alignment and a length for the allocation and the deallocation. Fun fact - it’s also UB to use a memory allocation with one alignment for a purpose with another alignment (like you might if you were, say, trying to reinterpret a vector’s contents as a non byte aligned struct).

Then, once you’ve got an allocator, there are unstable methods to use it piecemeal (like Box::new_in for example), but these aren’t stable so I won’t really be looking at them here. Instead, we’ll be trying to build an allocator that gets used globally for our whole program. To do that, you have to do something like this:

pub struct Allocator {/* contents */ }; 

#[global_allocator]
static ALLOCATOR: Allocator = Allocator {/* contents */};

Interestingly enough, GlobalAlloc is one of a few unsafe traits, which means that there are extra invariants you must follow. For GlobalAlloc, you must ensure the following:

Allocation and Deallocation cannot panic.
You must completely follow the Layouts.
You must not rely on allocations not being optimised away.

Something simple to start?#

How about we start with just making an ‘allocator’ that just serves 4 byte alignment and size sections of memory. Why - so I can test it easily and make sure it actually works using miri. Then, I can slowly expand the functionality and improve it whilst keeping it correct.

My first best guess for an allocator would just be to pretend to be the heap, and use the stack 😉. I’ve taken some inspiration on that front from the example arena allocator from the GlobalAlloc docs.

We’ll start here:

use std::sync::atomic::AtomicBool;
use std::cell::UnsafeCell;
use std::alloc::{GlobalAlloc, Layout};

pub struct FourAllocator {
    is_allocating: AtomicBool,
    tracker: UnsafeCell<usize>,
    spaces: UnsafeCell<[u8; usize::BITS as usize * 4]>
}

unsafe impl Sync for FourAllocator {}

unsafe impl GlobalAlloc for FourAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        todo!()
    }

    unsafe fn dealloc (&self, ptr: *mut u8, layout: Layout) {
        todo!()
    }
}

#[global_allocator]
static ALLOCATOR: FourAllocator = FourAllocator {
    tracker: UnsafeCell::new(0),
    is_allocating: AtomicBool::new(false),
    spaces: UnsafeCell::new([0; usize::BITS as usize * 4])
};


fn main () {
    let test_ptr = unsafe { ALLOCATOR.alloc(Layout::new::<i32>()) };
}

We first define our allocator struct, consisting of an atomic boolean to act as a lock (I’ll come back to this), then a tracker (I’ll also come back to this later), and then an array of spaces held inside an UnsafeCell. UnsafeCell grants us the magical power of interior mutability - the ability to mutate something without needing &mut. You’ve probably used this before, but it would’ve been shielded by something like Mutex. Here, we’re using it by itself to handle our ‘allocations’. The difference here is that you just get a *mut T, and you are responsible for ensuring that it doesn’t get concurrently read or wrote to in different threads. That’s why we’ve got a lock.

Then Sync is the marker trait for allowing things to be shared between threads, and it is unsafe to implement because it can break lots of external guarantees. We’re safe here though because we ensure that the same spaces aren’t accessed concurrently and mutably in different threads. More info on Send and Sync can be found here.

We then create our allocator using #[global_allocator], and create an i32 on the Heap using Box. Anyone want to guess what happens if we run the program? We stack overflow as the allocator repeatedly calls itself when it allocates a string to print for the todo!. This would seem bad, but it confirms that we are indeed using the custom allocator. This limitation of not being able to allocate when allocating can make things difficult later, but I’ve got some plans cooking to help with that.

Maybe not so simple?#

The first step should be actually checking for our 4 size and alignment, which can be done with a quick if, returning a null pointer if not - which is the expected behaviour for out-of-memory and other errors in the docs. If we keep our todo!() underneath for now, but take it out from dealloc, then we can also test this!

If we try to allocate something with the correct size and alignment, we should see a stack overflow (as it reaches the todo!()), and if not then nothing should happen:

//Allocator:
unsafe impl GlobalAlloc for FourAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        if layout.size() != 4 || layout.align() != 4 {
            return std::ptr::null_mut();
        }

        todo!()
    }

    unsafe fn dealloc (&self, ptr: *mut u8, layout: Layout) {}
}

//Main:
fn main () {
    //run 1:
    let layout = Layout::new::<u32>(); //u32 has layout 4, 4
    //run 2:
    let layout = Layout::new::<u8>(); //u8 has layout 1, 1

    let ptr = unsafe { std::alloc::alloc(layout) };
}

And we then run it?

cargo r --quiet

memory allocation of 10 bytes failed
error: process didn't exit successfully: `target\debug\i32-allocator.exe` (exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)

# change to run two
cargo r --quiet

memory allocation of 10 bytes failed
error: process didn't exit successfully: `target\debug\i32-allocator.exe` (exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)

10 bytes?#

Right, so something’s gone very wrong here - we only tried to allocate either 1 or 4 bytes, not 10. Ideally, the two runs should’ve also looked different.

error: unsupported operation: can't call (diverging) foreign function: __rust_alloc_error_handler
   --> C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\alloc\src\alloc.rs:383:13
    |
383 |             __rust_alloc_error_handler(layout.size(), layout.align());
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ can't call (diverging) foreign function: __rust_alloc_error_handler
    |
    = help: this is likely not a bug in the program; it indicates that the program performed an operation that the interpreter does not support
    = note: BACKTRACE:
    = note: inside `std::alloc::handle_alloc_error::rt_error` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\alloc\src\alloc.rs:383:13: 383:70
    = note: inside `std::alloc::handle_alloc_error` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\alloc\src\alloc.rs:389:9: 389:75
    = note: inside `alloc::raw_vec::RawVec::<u16>::allocate_in` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\alloc\src\raw_vec.rs:204:27: 204:53
    = note: inside `alloc::raw_vec::RawVec::<u16>::with_capacity_in` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\alloc\src\raw_vec.rs:145:9: 145:69
    = note: inside `std::vec::Vec::<u16>::with_capacity_in` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\alloc\src\vec\mod.rs:672:20: 672:61
    = note: inside `std::vec::Vec::<u16>::with_capacity` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\alloc\src\vec\mod.rs:481:9: 481:49
    = note: inside `std::sys::windows::to_u16s::inner` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\sys\windows\mod.rs:186:32: 186:63
    = note: inside `std::sys::windows::to_u16s::<&str>` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\sys\windows\mod.rs:198:5: 198:22
    = note: inside `std::sys::windows::thread::Thread::set_name` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\sys\windows\thread.rs:66:32: 66:45
    = note: inside `std::sys::windows::init` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\sys\windows\mod.rs:66:5: 66:39
    = note: inside `std::rt::init` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\rt.rs:97:9: 97:39
    = note: inside closure at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\rt.rs:147:42: 147:67
    = note: inside `std::panicking::r#try::do_call::<{closure@std::rt::lang_start_internal::{closure#1}}, ()>` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\panicking.rs:552:40: 552:43
    = note: inside `std::panicking::r#try::<(), {closure@std::rt::lang_start_internal::{closure#1}}>` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\panicking.rs:516:19: 516:81
    = note: inside `std::panic::catch_unwind::<{closure@std::rt::lang_start_internal::{closure#1}}, ()>` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\panic.rs:142:14: 142:33
    = note: inside `std::rt::lang_start_internal` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\rt.rs:147:5: 147:70
    = note: inside `std::rt::lang_start::<()>` at C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\std\src\rt.rs:165:17: 170:6

I can’t really make heads or tails of most of that, but if you skim around carefully, you can see lots of references to the rust runtime startup. That then explains our issues - something else is allocating. This also makes sense because the runtime startup probably allocates to do things like setting up the environment variables and command-line arguments in a nice place.

If this is happening, it might be easier to just allocate pointers as and when we need them, rather than trying to create a global allocator from the start. This’ll also let us do other useful things like allocating during our alloc function. For now though, we’ll keep using the skeleton of GlobalAlloc, if only for it being good practice.

Local Allocator#

We’ll start by commenting out #[global_allocator]. Then, we’ll replace std::alloc::alloc with ALLOCATOR.alloc.

If we re-run our test from earlier (just to remind you - it should panic on unimplemented the first time, and not do anything on the second time):

cargo r --quiet

thread 'main' panicked at src\main.rs:18:9:
not yet implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\debug\i32-allocator.exe` (exit code: 101)

# change to run two
cargo r --quiet

# program exits successfully!

Success!

The next step would be to actually store allocations. We need some way to track when a slot is allocated and when it isn’t, and that’s why we’ve got a usize tracker and are storing usize::BITS * 4 u8s. We can use each bit in the usize to mark whether or not the relevant slot has been handed over yet.

Actually Storing Allocations?#

One of the invariants of UnsafeCell that we need to uphold is to ensure that it isn’t accessed mutably by more than one thread at a time. We also need to hold up our own invariant, which is that the tracker is always either accurate or overzealous. In the Rust where we’re allowed to allocate, the solution is clear cut and obvious - Arc<Mutex<usize>>, but that doesn’t hold here because that allocates :/ .¹

Instead, we’ll have to use an AtomicBool for whether we’re ‘holding the lock’, and then an UnsafeCell to store the actual usize. Before you point out that we could just use an AtomicUsize, we need to iterate through the usize a lot to find a space and I’m not sure how to ensure that an AtomicUsize doesn’t get accessed for a portion of time. Because of that, we need to emulate a Mutex and luckily, AtomicBool has a method called compare_exchange where we can use Ordering::SeqCst to keep it all thread-safe and ensure that only one thread will not be waiting at any time. compare_exchange takes two arguments at the start - a current variable and a target variable. It’ll only switch the AtomicBool to be the target, if it is currently the current. Using that, we can then set it to only set it to be allocating when it isn’t already allocating. The other two arguments are about ordering - the first is for what ordering to use when comparing and setting on success, and the second is for how to load current on failure. Here, I’ll just use SeqCst because I’m never 100% sure on Atomic Orderings, and I know this one will work.

while self.is_allocating.compare_exchange(false, true, SeqCst, SeqCst).is_err() {std::thread::yield_now()}

Since it returns Result::Err whenever it fails to store the new value, we can just wait for it to not do that. To give a small performance boost, we can also tell it to yield_now which just tells the CPU scheduler to go to another thread. This makes sense because we can’t do anything until we’ve allocated so we may as well go elsewhere.

Then, we can get the pointer to our tracker, and copy it to the stack for convenience:

let tracker_ptr = self.tracker.get(); 
let tracker_copied = *tracker_ptr;

UnsafeCell::get is unsafe, but we know that our use is sound because we’re the only thread accessing this right now (because of our AtomicBool lock).

Then, we need to find the next slot that’s free in tracked_copied.

let mut chosen_index = None;
for i in 0..usize::BITS {
    if tracker_copied & (1 << i) == 0 {
        *tracker_ptr = tracker_copied | (1 << i);
        chosen_index = Some(i as usize);
        break;
    }
}

So, we have a counter for iterating through the bits of the tracker. We’re counting 0 as free and 1 as occupied, as we start our tracker as all zeroes. We check each bit by shifting 0b1 the relevant number of bits, and if we logical and that with the tracker_copied and get 0 we must know that that slot is free. We then update tracker_copied using a logical or, and set chosen_index to be the index we ended up using. We then break the loop.

Now that we’re done with parts that care about the tracker, we can ‘hand back’ our ’lock’ by storing false. We then also need to upkeep our invariants by ensuring not to use the pointer we hold for the rest of the function.

self.is_allocating.store(false, SeqCst);

Then, we need to extract the our chosen index. We aren’t 100% sure that we actually had a free space, but if we didn’t then that variable never got set to Some, and so we can extract it using let-else:

let Some(chosen_index) = chosen_index else {
    return std::ptr::null_mut();
};

That way, if we didn’t find a index, we can just return a null pointer.

Finally, we can actually return a valid index into our array with:

(self.spaces.get() as *mut u8).add(chosen_index * 4)

Since self.spaces is an array, we know it is all in one space in our stack memory, so we can just cast it to the element in that array and add on chosen_index * 4 to give each allocation 4 bytes.

So, overall:

unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
    if layout.size() != 4 || layout.align() != 4 {
        return std::ptr::null_mut();
    }

    //wait until we can start allocating
    while self.is_allocating.compare_exchange(false, true, SeqCst, SeqCst).is_err() {std::thread::yield_now()}
    //theoretically, now only this thread should be able to allocate
    
    let tracker_ptr = self.tracker.get(); 
    let tracker_copied: usize = *tracker_ptr;
    //Sound, as we're the only thread accessing this pointer
    let mut chosen_index = None;
    for i in 0..usize::BITS {
        if tracker_copied & (1 << i) == 0 {
            *tracker_ptr = tracker_copied | (1 << i);
            chosen_index = Some(i as usize);
            break;
        }
    } 

    //we're done with the tracker, so we can give back our lock
    self.is_allocating.store(false, SeqCst);

    let Some(chosen_index) = chosen_index else {
        return std::ptr::null_mut();
    };

    (self.spaces.get() as *mut u8).add(chosen_index * 4)
}

But does it actually work?#

If we go back into our main function, we can write a quick function to keep on allocating until we get a null pointer (which here, means that we ran out of memory):

let layout = Layout::new::<u32>();
let mut i = 0;
loop {
    let ptr = unsafe { ALLOCATOR.alloc(layout) };
    println!("Got {ptr:?}");
    
    if ptr.is_null() {
        println!("Ran out @ attempt {}", i + 1);
        break;
    }

    i += 1;
}

Then, if we run that we can see that it successfully manages 64 allocations before getting a null pointer. That makes sense, because on a 64-bit platform (like my machine), addresses are 64-bit and so a usize has 64 bits. We can even run it using cargo +nightly miri run and still have success which is a good sign.

There’s still another test though - what about multi-threading? Since we know we’ll have a maximum of 64 allocations, we can spin up 5 threads which will each attempt to make 15 allocations so we can see that multiple threads can access our allocations and all will work:

for t in 0..5 {
    let cloned_layout = layout.clone();
    std::thread::spawn(move || {
        for r in 0..15 {
            let ptr = unsafe {ALLOCATOR.alloc(cloned_layout)};
            println!("T{t}R{r} got {ptr:?}");
        }
    });
}

And then (with miri for safety):

cargo +nightly miri run

# warnings from the compiler about not using variables in `dealloc`
T0R0 got 0x2da10
T1R0 got 0x2da14
T2R0 got 0x2da18
T3R0 got 0x2da1c
T4R0 got 0x2da20
error: the main thread terminated without waiting for all remaining threads

It looks like each thread managed to get 1 allocation done before stopping without waiting?

Right, so it turns out that I just forgot to tell the main thread to actually wait for the threads to finish, so let’s quickly fix that:

let mut threads = vec![];

for t in 0..5 {
    let cloned_layout = layout.clone();
    let handle = std::thread::spawn(move || {
        //do thread stuff
    });
    threads.push(handle);
} 

for mut handle in threads {
    handle.join().expect("unable to join handle");
}

I’ve deliberately picked 5 threads - too many could lead to nobody actually waiting for the allocator because of the time it takes to start a thread, and too few wouldn’t have a high enough chance of causing interesting race conditions (if there were any).

And if we then run it we see a bunch of different memory addresses and 11 null pointers. 5 * 15 - 11 = 64, which means that we managed to allocate the right number of addresses, and so theoretically didn’t have race conditions. Yay!

What about de-allocating?#

So, right now we’ve got 64 allocations for the entire span of our program. That’s not very much, so it’d be nice to be able to de-allocate, so we become limited to 64 concurrent allocations, with theoretically infinite allocations over the span of the program. The following code samples will be to start filling out fn dealloc (...).

When deallocating, an invariant that is held for us (how nice!) is that that pointer was obtained from this allocator and so we can make a few very handy assumptions. Also - since memory from alloc is regarded as uninitialised (and it is in fact UB to access it without writing to it first), we don’t have to worry about zeroing the memory².

This means that we really only have to wait to use the tracker, and then zero the relevant bit. We can work out the relevant bit because we know how long each section is and where the memory block we have starts.

let offset = (ptr as usize - self.spaces.get() as usize) / 4;
while self.is_allocating.compare_exchange(false, true, SeqCst, SeqCst).is_err() {std::thread::yield_now()}

let ptr = self.tracker.get();
*ptr = *ptr - (1 << offset);

self.is_allocating.store(false, SeqCst);

We can then miri a quick single-threaded test:

for _ in 0..1000 {
    let ptr = unsafe {ALLOCATOR.alloc(layout)};
    if ptr.is_null() {
        eprintln!("What?");
    }
    unsafe { ALLOCATOR.dealloc(ptr, layout) };
}

And it exits successfully, which implies that everything worked - yay!

If we also tell it to print out the pointers, we see that we only actually used one of our slots because each time it would search for places and just find the first place every time!

If we modify it to only deallocate every 5 runs, and to only to do 16 total runs:

let mut to_be_freed = Vec::with_capacity(10);
for i in 0..16 {
    let ptr = unsafe {ALLOCATOR.alloc(layout)};

    println!("{ptr:?}");
    to_be_freed.push(ptr);

    if i % 5 == 0 {
        for ptr in std::mem::take(&mut to_be_freed) {
            unsafe { ALLOCATOR.dealloc(ptr, layout) };
        }
    }
}

We can see that it re-uses the same 5 slots:

cargo +nightly miri run

0x24320 //start, clears list & deallocates because 0 % 5 == 0    0
0x24320 //starts again at beginning                              1
0x24324 //                                                       2
0x24328 //                                                       3
0x2432c //                                                       4
0x24330 //5 % 5 == 0, so deallocates                             5
0x24320 //starts again                                           6
0x24324 //                                                       7
0x24328 //                                                       8
0x2432c //                                                       9
0x24330 //deallocates as 10 % 5 == 0                             10
0x24320 //starts again                                           11
0x24324 //                                                       12
0x24328 //                                                       13
0x2432c //                                                       14
0x24330 //deallocates as 15 % 5 == 0                             15

Final code can be seen here.

What now?#

We could continue on with this basic structure of a bit-tracker and a bank of u8s held in an UnsafeCell. The problems start when you can’t rely on 1 bit being equal to 1 allocation, as you then have to start searching for long spaces which have enough space. If you want to go beyond 64 slots, you’d also need to start handling multiple usizes (possibly by using something like bitvec::BitArray). Then, you need to have all of your memory stored on the stack which would a) limit you to a relatively small size and b) mean that you have to have all of the memory at the ready for the whole program, regardless of how much you actually need.

You might also need to start worrying about heap fragmentation, where the heap has enough free slots for the memory required but they aren’t contiguous, so you then need to make sure that your allocations are in ideal places (eg. small allocations in gaps to leave room for bigger ones to come later).

All of that having been said, I’ll be damned if I’m not going to even try to make a really shit one that technically works for any sized allocation.

Time to handle more layouts#

We’ll be starting from scratch here, but based off a similar structure to the previous allocator and I’ll start with a few things that would get annoying quickly with a multi-sized allocator.

`tracker` replacement#

Instead of storing one tracker in an UnsafeCell, we’ll store an array of Option<Allocation>s. This will lead to a maximum number of allocations (which I’ll control using a constant generic parameter here), but that number will not change based off the size of things allocated. I’ll also control the size of our ‘heap memory’ using a constant generic parameter. I’ll also add a function to quickly get the whole range of an Allocation for convenience.

use std::alloc::Layout;
use std::sync::atomic::AtomicBool;
use std::cell::UnsafeCell;

#[derive(Copy, Clone)]
struct Allocation {
    start: *mut u8,
    layout: Layout,
}

impl Allocation {
    pub fn get_range (&self) -> (*mut u8, *mut u8) {
        //we need to minus one because this range is inclusive
        (self.start, unsafe { self.start.add(self.layout.size() - 1) })
    }
}

struct Allocator<const MAX_ALLOCS: usize, const MEMORY_SIZE: usize> {
    is_allocating: AtomicBool,
    allocations: UnsafeCell<[Option<Allocation>; MAX_ALLOCS]>,
    memory: UnsafeCell<[u8; MEMORY_SIZE]>
}

Finding a space to allocate into#

Then, we’ll start with a convenience function to find the next free space of a set size. One interesting thing to note is that I’ve decided to make it unsafe, with the invariant being that we hold the ’lock’. Below is just a skeleton which we’ll fill out as we go.

impl<const MAX_ALLOCS: usize, const MEMORY_SIZE: usize> Allocator<MAX_ALLOCS, MEMORY_SIZE> {
    ///SAFETY: must hold onto the `is_allocating` lock before calling
    ///
    ///Finds the next space free for an allocation of given length. NB: does not update
    ///`allocations`
    pub unsafe fn find_next_space (&self, length: usize) -> *mut u8 {
        let current_allocations = *self.allocations.get();
        let end = self.memory.get().add(length);

        let mut next_space_to_chk = self.memory.get();
        loop {
            if next_space_to_chk.add(length) > end {
                break;
            } 

        }

        std::ptr::null_mut()
    }
}

We quickly copy our list of allocations and store a convenience variable for the end of our list. We also make a temporary variable for the next space we need to check. Then, at the start of each loop we check if we can allocate that pointer without going past the end of the list, and if we can’t we break. If we break out of the loop, the only thing we can do is to just return a None. Now, we’ll have a look over a naive algorithm to find the next usable block of memory that takes place during the loop, after the out-of-bounds check.

let (start_check, end_check) = (next_space_to_chk, next_space_to_chk.add(length));

let mut space_found = true;
for allocation in current_allocations.iter().flatten() {
    let (start_block, end_block) = allocation.get_range();

    if (start_block <= start_check && start_check <= end_block) || (start_block <= end_check && end_check <= end_block) {
        space_found = false;
        next_space_to_chk = end_block.add(1);
        break;
    }
}

if space_found {
    return Some(next_space_to_chk);
}

Here, we just loop through all of our allocations to check if any part of our desired block for allocation is within another allocation. If it is, then we set the next spot to check to be just after the end of the block we intersected with and break the loop. If it isn’t, then we return that slot. We can then use this to write an actual function to allocate!

Allocating Function#

Firstly, we’ll quickly implement Sync for Allocator. Like before, this is an unsafe impl but we know that this is sound because we control access to be single-threaded for the UnsafeCell. We can then also quickly tell Rust to use this as the global allocator like we did right at the start. I’ll also add a skeleton GlobalAlloc implementation.

use std::alloc::GlobalAlloc;

unsafe impl<const MAX_ALLOCS: usize, const MEMORY_SIZE: usize> GlobalAlloc for Allocator<MAX_ALLOCS, MEMORY_SIZE> {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        todo!()
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        todo!()
    }
}

unsafe impl<const MAX_ALLOCS: usize, const MEMORY_SIZE: usize> Sync for Allocator<MAX_ALLOCS, MEMORY_SIZE> {}

#[global_allocator]
static ALLOCATOR: Allocator<1000, 1_000_000> = Allocator {
    is_allocating: AtomicBool::new(false),
    allocations: UnsafeCell::new([None; 1000]),
    memory: UnsafeCell::new([0; 1_000_000])
};

Then, we have to build a function to actually allocate. These are the steps we need:

Obtain the lock in order to make it sound to mutate allocations.
Find a space to allocate (and if not, early return std::ptr::null_mut).
Find a space to add it to our allocations list. If we find a spot, add it to the list (and if not, then std::ptr::null_mut).
Unlock the lock.
Return our pointer from Step 2.

This is what I’ve managed for that:

unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
    //get the lock
    while self.is_allocating.compare_exchange(false, true, SeqCst, SeqCst).is_err() {std::thread::yield_now()}

    //find the position
    let Some(position_to_allocate) = self.find_next_space(layout.size()) else {
        //out of memory, return null_mut
        return std::ptr::null_mut();
    };

    //add it to our array of allocations
    let allocations_ptr = self.allocations.get() as *mut Option<Allocation>;
    let mut found_spot = false;
    for i in 0..MAX_ALLOCS {
        let this_spot = allocations_ptr.add(i);
        if (*this_spot).is_none() {
            *this_spot = Some(Allocation { start: position_to_allocate, layout });
            found_spot = true;
            break;
        }
    }
    if !found_spot {
        //out of allocations, return null_mut
        return std::ptr::null_mut();
    }
    
    self.is_allocating.store(false, SeqCst);
    
    position_to_allocate
}

We first use the same snippet as earlier to acquire the lock. Then we use let-else to extract our pointer or early return. We then iterate through our array of allocations, and when we find a free slot we add our allocation there. If we didn’t find a free slot, then we early return. We then unlock and return the pointer.

We can then add a simple main function that does a few allocations to test this:

fn main() {
    println!("Hello, world!");
    let a = 5_i32;
    println!("Variable a = {a}");

    let mut v = (1..10).into_iter().map(|x| (x, x * 5)).collect::<Vec<(u8, u8)>>();
    for (one, five) in v {
        println!("{one} * 5 = {five}");
    }
}

And if we check our console, we can see that we get all of our expected output:

$ cargo run --quiet
Hello, world!
Variable a = 5
1 * 5 = 5
2 * 5 = 10
3 * 5 = 15
4 * 5 = 20
5 * 5 = 25
6 * 5 = 30
7 * 5 = 35
8 * 5 = 40
9 * 5 = 45

But that’s not the end of allocation, unfortunately.

Did you really think we’d avoided UB?#

If we run it through miri instead, we get this:

$ cargo +nightly miri run
error: Undefined Behavior: constructing invalid value: encountered an unaligned reference (required 8 byte alignment but found 1)
    --> C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\alloc\src\sync.rs:1834:24
     |
1834 |             ptr::write(&mut (*inner).strong, atomic::AtomicUsize::new(1));
     |                        ^^^^^^^^^^^^^^^^^^^^ constructing invalid value: encountered an unaligned reference (required 8 byte alignment but found 1)
     |
# further output removed - it just says that this is from the language runtime startup

And we come to an old friend - alignment. We didn’t really have to deal with this earlier (in our 4-4-only allocator), because the length was identical to the alignment and we didn’t just randomly skip bytes to have to deal with alignment. Here, however it is an issue. Luckily, the fix is relatively simple.

The first change needed is to make our find_next_space take in a Layout rather than just a length (and then in the callsite, we just remove the .size() call). Then, we need to change our code that changes next_space_to_chk. Previously, we just set it to one after the end of the block that we overlapped with, but now we need to set it to the next aligned pointer.

Rust integer division rounds down, so we can find the number of align spaces we need to add to get away from this block using (end_block as usize - start_check as usize) / align + 1. Then, we just multiply that by align. That’s this line: next_space_to_chk = end_block.add(1); going to: next_space_to_chk = next_space_to_chk.add(((end_block as usize - start_check as usize) / align + 1) * align);.

If we quickly run this through miri, we see no complaints! Next on the list is deallocation.

Deallocation#

Just like last time, de-allocation is vastly simpler than allocation. We just need to find which allocation struct was dealing with that allocation, and then remove it. We don’t even need to worry about using the layout when de-allocating because we know the pointer and the start of the allocation will be identical (as specified in the GlobalAlloc docs).

unsafe fn dealloc(&self, ptr: *mut u8, _layout: Layout) {
    while self.is_allocating.compare_exchange(false, true, SeqCst, SeqCst).is_err() {std::thread::yield_now()}       

    let allocations_ptr = self.allocations.get() as *mut Option<Allocation>;
    for i in 0..MAX_ALLOCS {
        let this_spot = allocations_ptr.add(i);
        match *this_spot {
            Some(Allocation { start, layout: _ }) if start == ptr => {
                *this_spot = None;
                break;
            },
            _ => {}
        }
    }

    self.is_allocating.store(false, SeqCst);
}

Let’s just run a multi-threading test for safety - I’ve constructed one that does computations and allocations on multiple threads to really check that we’ve got one that works:

fn main() {
    fn factorial (n: u128) -> u128 {
        if n <= 1 {
            return n;
        }
        
        n * factorial(n - 1)
    }

    let mut threads = vec![];
    for block in 0..6 {
        let t = std::thread::spawn(move || {
            let mut v = vec![];
            for n in (1..=5).into_iter().map(|x| block + x * 5) {
                v.push((n, factorial(n)));
            }

            println!("{v:?}");
        });

        threads.push(t);
    }

    for t in threads {
        t.join().unwrap();
    }
}

Theoretically, each thread should have a roughly similar amount of work to do with a naive factorial algorithm. This also works perfectly correctly using the standard allocator.

cargo +nightly miri run

[(5, 120), (10, 3628800), (15, 1307674368000), (20, 2432902008176640000), (25, 15511210043330985984000000)]
[(6, 720), (11, 39916800), (16, 20922789888000), (21, 51090942171709440000), (26, 403291461126605635584000000)]
[(7, 5040), (12, 479001600), (17, 355687428096000), (22, 1124000727777607680000), (27, 10888869450418352160768000000)]
error: Undefined Behavior: Data race detected between (1) non-atomic read on thread `<unnamed>` and (2) non-atomic write on thread `<unnamed>` at alloc65+0x4c8. (2) just happened here
    --> C:\Users\micro\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\alloc\src\vec\mod.rs:1923:13
     |
1923 |             ptr::write(end, value);
     |             ^^^^^^^^^^^^^^^^^^^^^^ Data race detected between (1) non-atomic read on thread `<unnamed>` and (2) non-atomic write on thread `<unnamed>` at alloc65+0x4c8. (2) just happened here
     |
# further output that doesn't help

And we’ve got a data race. 😔

UB 2: Electric Boogaloo#

To debug this, we should probably work out a way to provide some form of logging. The problem occurs when we realise that just using println would cause infinite recursion because that allocates a string³. I’ve got one idea on how to do this nicely and then a backup (which will definitely work).

Printing without allocating#

My first instinct is to go with the standard library, and see what I can do safely - and I can test both of these methods by just putting one at the top of my alloc function.

Usually, if I want to write to stdout without using println!, I have to get a handle to Stdout and then just use std::io::Write:

let s = "Allocating!\n";
std::io::stdout().lock().write_all(s.as_bytes()).unwrap();

Here, I’ve unwrapped the error - this technically breaks the invariant on not panicking but I’m going to ignore that for now because this function won’t be here any longer than about another sentence. If we run it we recurse and so we never even reach the unwrap which means that Stdout::write_all⁴ must allocate somewhere. We could dive deep and bypass it, or we could just go to where the standard library eventually links to - libc.

A while ago, I read Amos’s piece on Small Strings in Rust, and I remember that he somehow managed to print to stdout without allocating. That method is to use libc::write to stdout just passing a pointer and length. That does unfortunately⁵ mark our first dependency - I love doing things myself but I draw the line at syscalls (especially on Windows 😭). He then goes on to use an actually sensible method for getting the allocations to the console, but I’m going for something different⁶.

To test it to start with we’ll just try to print out a string:

let string = "Allocating!\n";
libc::write(???, string.as_ptr() as _, string.len() as _);

Most of that call is fine, apart from the ??? for which we need the file descriptor of the output stream we want to send stuff to. This is in libc for linux users under libc::STDOUT_FILENO, but not for windows 😔. I turned to DuckDuckGo and found this SO post, which mentions the file descriptor being 0. That did nothing for me, so I tried 1 and luckily it worked. Examining the post closer, it says 0! so I’m just going to assume that they meant factorial(0) rather than using ! for exclamation. (correction: u/skynetsatellite013 pointed out that I had actually misread the SO post and it was talking about STDIN_FILENO not STDOUT_FILENO 🤦) I’ll chuck that into a constant, and we then get a function for printing string references:

const STDOUT_FILENO: i32 = 1;
fn print_string (s: &str) {
    unsafe {
        libc::write(STDOUT_FILENO, s.as_ptr() as _, s.len() as u32);
    }
}

Quick note: we need to cast the pointer to a *const c_void, but since the Rust compiler knows that we can just use an _ to tell it to infer. This just saves an import.

Unfortunately, this is where miri forsakes us because we’re making function calls which it cannot decipher. Once we’re finished with our whole print_allocation function, I’ll show a quick method to still be able to use miri with the rest of the program.

Then, we can extend that logic to make a function to print numbers. After some quick searching to find how to get the digits of a number (I managed to adapt something from here) we can then print them out. That code spat out numbers in reverse order for me so I just print them out in reverse.

fn print_number (mut n: usize) {
    const NUMBERS: &[u8; 10] = &[b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8', b'9'];

    let mut numbers = [0; 64];
    let mut digit = 0;
    while n != 0 {
        let this_digit = (n % 10) as usize;
        numbers[digit] = this_digit;

        n /= 10;
        digit += 1;
    }

    while digit != 0 {
        digit -= 1;
        unsafe { libc::write(STDOUT_FILENO, (NUMBERS.as_ptr() as *const u8).add(numbers[digit]) as _, 1) };
    }
}

We can then use both of those functions to build out a print_allocation function:

fn print_allocation (layout: Layout) {
    print_string("Thread #");
    print_number(unsafe { winapi::um::processthreadsapi::GetCurrentThreadId() as usize});

    print_string(" Allocating ");
    print_number(layout.size());
    print_string("-");
    print_number(layout.align());

    print_string(" \n");
}

Because I’m on a windows machine, in order to get the thread ID⁷, I have to add another dependency - winapi with the processthreadsapi feature. I can then use winapi::um::processthreadsapi::GetCurrentThreadId().

Then, we can add this to the top of fn alloc(layout: Layout), just below where we acquire the lock.

#[cfg(not(miri))]
print_allocation(layout);

The conditional compilation statement just makes sure that this function never gets called if we’re running with miri.

As we worked out earlier, even if you do nothing, the main function does some allocations, so we can just run fn main () {} and see:

cargo r --quiet

Thread #2024 Allocating 10-2
Thread #2024 Allocating 5-1
Thread #2024 Allocating 48-8
Thread #2024 Allocating 64-8

If I run it multiple times, then I get the same allocations displayed, only with different thread IDs.

We can then add another function:

fn print_end_of_allocation () {
    print_string("\tThread #");
    print_number(unsafe { winapi::um::processthreadsapi::GetCurrentThreadId() as usize});
    print_string(" finished allocation\n");
}

And use it just before we lose the lock and with the same conditional compilation gate. We then see this as output:

cargo run --quiet

Thread #2072 Allocating 10-2
        Thread #2072 finished allocation
Thread #2072 Allocating 5-1
        Thread #2072 finished allocation
Thread #2072 Allocating 48-8
        Thread #2072 finished allocation
Thread #2072 Allocating 64-8
        Thread #2072 finished allocation
# more output cut

What were we doing again?#

If we were to then run our latest multi-threading test again we’d get a lot of output. To avoid this, I’ve just replaced the vec! calls with calls to Vec::with_capacity with capacity to hold the relevant struct. I’ve also added the thread ID to each call to println! inside a thread. That way, we might be able to see if it’s an issue with allocation or deallocation.

Unfortunately, the output suggests that only one thread is ever holding the lock as each allocation start is immediately followed by that thread ID saying that it finished that allocation. This suggests that our problem lies elsewhere, which is unfortunate because that might be harder to fix.

What else could it be?#

Thanks to the printing, we know at least know that only one thread is concurrently using the allocation function. However, since we’re still getting data races⁸ there must be concurrent access to pointers. The next step⁹ is then to look at where we choose our pointer to hand over in find_next_space:

pub unsafe fn find_next_space (&self, layout: Layout) -> Option<*mut u8> {
    let current_allocations = self.allocations.get().read();
    let end = (self.memory.get() as *mut u8).add(MEMORY_SIZE);

    let length = layout.size();
    let align = layout.align();

    let mut next_space_to_chk = self.memory.get() as *mut u8;
    loop {
        if next_space_to_chk.add(length) > end {
            break;
        }

        let (start_check, end_check) = (next_space_to_chk, next_space_to_chk.add(length));
        let mut space_found = true;
        for allocation in current_allocations.iter().flatten() {
                let (start_block, end_block) = allocation.get_range();
                if (start_block <= start_check && start_check <= end_block) || (start_block <= end_check && end_check <= end_block) {
                    space_found = false;
                    next_space_to_chk = next_space_to_chk.add(((end_block as usize - start_check as usize) / align) * align);
                    break;
                }
        }

        if space_found {
            return Some(next_space_to_chk);
        }
    }

    None
}

We make a copy of our existing allocations for convenience - we’re reading from the pointer, but we know that we won’t get any data races because we’ve locked the lock.
We make a convenience variable to store the pointer after what we’re allowed to allocate - we only ever compare to this so we’re fine. This accesses the memory pointer, but never dereferences it and we’re in the lock anyway.
We set a variable equal to the start of our memory. This will eventually become our pointer to the memory, but we modify and check it several times so we’ll deal with that in a minute.

Then, we enter a loop:

We check if we’ve gone out of our memory bounds. We can check for > end because next_space_to_chk.add(length) represents the pointer after the final one we use (because of zero-indexing).
We then create a convenience variable with the that next_space_to_chk.add(length) as the value.
We also declare a temporary variable for a flag, which assumes that we will find a space. We can then invalidate that if we overlap in the loop.

We then enter another loop for every allocation:

We get the inclusive range of pointers that that allocation uses.
We then check if either our start or end checking pointers are in the range. I’m not 100% sure about this maths, so we’ll check back on it later but it looks right to me right now.
If that check works, then we invalidate the assumption that we found a space.
We then do some pointer maths to add a multiple of align that should be just after the end_block. If we’re testing this part, we can make the assumption that we have far more memory than we need for this program. From our earlier tests, I didn’t see any allocations bigger than about 100 bytes, so if we just add 1000 * align, we’re pretty much guaranteed to be fine¹⁰. If we run this with multi-threading under miri it works! 🥳🥳🥳¹¹

This suggests that there’s an error with my logic that chooses when we’ve overlapped with another allocation. I’m also not the happiest with how this function looks so I’m going to make another pass over it:

pub unsafe fn find_next_space (&self, layout: Layout) -> Option<*mut u8> {
    let current_allocations = self.allocations.get().read();
    let end = (self.memory.get() as *mut u8).add(MEMORY_SIZE);

    let length = layout.size();
    let align = layout.align();

    let mut next_space_to_chk = self.memory.get() as *mut u8;
    loop {
        let end_check = next_space_to_chk.add(length - 1);
        if end_check >= end {
            return None;
        }

        let space_found = current_allocations.iter().flatten().all(|allocation| {
            let (start_block, end_block) = allocation.get_range();
            (next_space_to_chk < start_block && end_check < start_block) || (next_space_to_chk > end_block && end_check > end_block)
        });

        if space_found {
            return Some(next_space_to_chk);
        }
        next_space_to_chk = next_space_to_chk.add(align);
    }
}

This new function reduces some of the duplication I had earlier like both start_check and next_space_to_chk existing which was confusing. This function also works with no UB which would suggest that it is sound. I’ve also changed the logic for working out whether we overlap - either the whole block needs to be before the start, or after the end which makes more sense to me.

The low-hanging fruit for optimisation is to change the all call to a find call so we know which block we overlapped, to reduce the number of iterations.

let overlapping_block = current_allocations.iter().flatten().find(|allocation| {
    let (start_block, end_block) = allocation.get_range();
    !((next_space_to_chk < start_block && end_check < start_block) || (next_space_to_chk > end_block && end_check > end_block))
});

match overlapping_block {
    Some(block) => {
        let end = block.get_range().1;

        let delta = end as usize - next_space_to_chk as usize;
        let aligns = delta / align + 1;
        next_space_to_chk = next_space_to_chk.add(align * aligns);
    },
    None => return Some(next_space_to_chk)
}

And that’s us done! Final code can be seen here.

Closing Thoughts#

I can now technically say that I’ve built an allocator - more than likely one of the worst ones you’ve ever seen, but one that does work just about. There are however a few problems with it right now:

Because I don’t really know how to get more memory on the fly¹², all of the memory in this allocator exists for the entire program which means that you have to know how much memory you are going to need at compile-time which is pretty annoying. That’s why in the gist I just give it a gigabyte¹³ of RAM to make that a problem that I don’t have to deal with.
This allocator just tries to find the first space it can which would result in pretty bad heap fragmentation in a longer running program. This is where the heap has more than enough space to store what you’re trying to allocate but not contiguously because of bad planning. For example, let’s say we have a 10 byte heap. We allocate 2 u32s in the first 8 bytes. We then deallocate the first u32, and allocate a u8. That then goes in the first free space at the very start. We now have 5 bytes allocated out of 10, but nowhere to put another u32 because our heap is fragmented.
Because we’re not getting a heap in a way that miri understands, miri and other similar tools cannot give us any help with pointers that aren’t freed - if we put a quick let ptr = unsafe { alloc(Layout::new::<u8>()) };, and then never free that we won’t have any way to know except for careful program analysis, which as we all know is 100% effective. It also won’t warn us for use-after-frees. It is still a useful tool for things like concurrent accesses, but not much more than that.

That pretty much sums up my thoughts on the topic. I’ve gotta also say that I’ve loved doing this in Rust - the only runtime error that took me forever to figure out was the UB right at the end. This was also a really interesting project for me, because I got to take a deep look at a thing that I haven’t really even thought about before and that I usually just take for granted.

As with my other memory-related post, I’m happy to acknowledge that I’m not an all-knowing Rust God and if I’ll happily add corrections with credit. All I can say is that I’ve tried to ensure that the code is sound.

Yes, I know that because we’re using our own allocator piecemeal that we are allowed to allocate and deal with allocating stuff. However, firstly this is good practice for later when we actually want to allocate everything later. Also - Arc::new isn’t const, so we can’t create it in a static context. Yes, we could just create the allocator at the top of main, but then we go back to the first point. ↩︎
No guarantees are made about the contents of memory from alloc, but if you need pre-zeroed memory one of the provided functions for GlobalAlloc is GlobalAlloc::alloc_zeroed, which just gets the memory and writes zeroes to it if the pointer wasn’t null. ↩︎
How do I know? I was pretty sure it did, and then I added a quick println! to the top of Allocator::alloc to confirm. ↩︎
Stdout::write also hits a recursion error, so not it’s not that. I just didn’t want to deal with partial writes here. ↩︎
Unfortunately for my NIH, not for any actual reason. For actual projects though, I’m more than happy to add dependencies. ↩︎
He creates an array by [0; 1024], then creates a Cursor holding the array which implements std::io::Write, so he can then use serde_json::to_writer. Then, he can get the end of the cursor to create a slice for all of the characters of the array that actually matter. ↩︎
I can’t use std::thread::current().id().as_u64() as that allocates. It’s also unstable right now, and this article is about stable Rust. I also can’t just use libc::pthread_self() because windows. ↩︎
Just to clarify - this is when multiple threads try to concurrently store data in the same memory address. The problem is that which thread writes first is non-deterministic outside of Atomic operations, which we aren’t using here (because Rust only has a few built-in Atomic types, which wouldn’t cover a 1 million byte array). ↩︎
This should really have been the first step, but I was already looking for an excuse to print without allocating. ↩︎
It’s also relatively easy to test this - just quickly remove the multi-threading parts, and it all works fine even under miri. ↩︎
You know what they say - premature optimisation is the root of all evil - relevant xkcd. ↩︎
Apart from just using std::alloc::System, but at that point you’re basically just wrapping it and probably doing a worse job. ↩︎
Fun fact - a gigabyte is actually 1 million bytes because we’re using the SI prefix of giga. If you want to talk about 1024 * 1024 bytes, you need to use a gibibyte (aka GiB). Similar binary prefixes also exist for smaller units like mebibyte and kibibyte. The confusion comes where people mess them up (sometimes accidentally, and sometimes accidentally) and you get hard drives seeming smaller than they actually are and other similar problems. ↩︎

Building 2 ‘Simple’ Allocators

Plan of Attack

Where do I even start?#

Something simple to start?#

Maybe not so simple?#

10 bytes?#

Local Allocator#

Actually Storing Allocations?#

But does it actually work?#

What about de-allocating?#

What now?#

Time to handle more layouts#

`tracker` replacement#

Finding a space to allocate into#

Allocating Function#

Did you really think we’d avoided UB?#

Deallocation#

UB 2: Electric Boogaloo#

Printing without allocating#

What were we doing again?#

What else could it be?#

Closing Thoughts#

Building 2 ‘Simple’ Allocators

Plan of Attack

Where do I even start?#

Something simple to start?#

Maybe not so simple?#

10 bytes?#

Local Allocator#

Actually Storing Allocations?#

But does it actually work?#

What about de-allocating?#

What now?#

Time to handle more layouts#

tracker replacement#

Finding a space to allocate into#

Allocating Function#

Did you really think we’d avoided UB?#

Deallocation#

UB 2: Electric Boogaloo#

Printing without allocating#

What were we doing again?#

What else could it be?#

Closing Thoughts#

`tracker` replacement#