Integer Compression
Plan of Attack
Recently, I’ve been working on a compression engine (Souris), and I’m really pretty proud of some of my compression tricks, so I thought I’d go through a few of them here!
We’ll start with one of the fundamental ones - integers. The main point of this compression is to get integers to take as few bytes as possible - if I’m encoding the length of a list, the chances are that it might only be 10 or a 100, which could probably fit in one byte, but normally takes 81.
For reference, whenever I mention serialising
, I’m talking about turning an in-memory structure into a bunch of bytes that can be written, and I’m using deserialising
to refer to turning the bytes back into an in-memory structure.
Getting started⌗
The way I think is best to do this is to go through two main stages - we first take in a number and turn it into an intermediary form that’s easy to serialise and deserialise as well as being easy to get from/into a normal number type like u32
or usize
.
For now, we’ll only be dealing with unsigned integers, as signed integers make this quite messy.
usize
-> Integer
⌗
We’ll first create our intermediary struct Integer
:
const MAX_BYTES: usize = usize::BITS as usize / 8;
struct Integer {
content: [u8; MAX_BYTES],
bytes_used: usize
}
We can see that we’ve got space to store all the bytes for anything up to a u128
, the current largest integer, as well as a variable to store how many bytes we’ve used in this.
For now, let’s start by making the conversion from usize
, and then we can work out how to generalise it. In Rust, when converting infallibly2, we typically implement From
, so let’s add a skeleton for that.
impl From<usize> for Integer {
fn from (n: usize) -> Self {
todo!("AGRRGGHHH!!")
}
}
We need to fill out those bytes, and luckily usize
has a method called usize::to_le_bytes
, which converts our usize
into an array of little-endian bytes. Little-endian bytes are the standard on modern machines, so it is easier for the CPU to get it in that form, so we can use that.
let mut content = [0_u8; MAX_BYTES];
for (i, b) in n.to_le_bytes().into_iter().enumerate() {
content[i] = b;
}
That all works! The next part of our scaffold is working out how many bytes we actually used here. There’s two ways we could do this:
- We could use
usize::ilog2
and some maths to work out how many bytes we used. - We could find the last full byte and work out how many bytes are empty, and use that to work out the number we used.
Having done both, they’re both equally good solutions. However, I’ve encountered some issues with the first3 as the maths isn’t 100% obvious on reading. It’s the eternal dilemma of code - I love the sick one-liners and cool tricks, but I spend far more time reading the code than writing it so I’m going to go with the simpler solution here.
To go with the second, we just need to find the index of the last byte that isn’t empty, and then add 1 to get the number of bytes we used.
let mut content = [0_u8; MAX_BYTES];
let mut last_non_zero_byte = 0;
for (i, b) in n.to_le_bytes().into_iter().enumerate() {
content[i] = b;
if b != 0 {
last_non_zero_byte = i;
}
}
Integer {
content,
bytes_used: last_non_zero_byte + 1
}
If we add a quick #[derive(Debug)]
to the Integer
declaration, and a quick main function, we get:
fn main () {
let int: Integer = 1000.into();
println!("{int:?}");
}
Which then outputs:
Integer { content: [232, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], bytes_used: 2 }
We used two bytes, and those numbers look alright to me so let’s keep on going!
Integer
-> Vec<u8>
⌗
Theoretically here we just need to serialise the number of bytes used (which we can do in just one byte, as MAX_BYTES
is equal to 16 which is less than u8::MAX
), and then add all of the relevant bytes!
impl Integer {
pub fn ser (&self) -> Vec<u8> {
let mut output = vec![];
output.push(self.bytes_used as u8);
output.extend(&self.content[0..self.bytes_used]);
output
}
}
Theoretically here we could even skip serialising the length! However, that trick only works if you are only dealing with Integers alone in arrays, as otherwise it isn’t clear how much to read. If you do serialise the length, you could convert our next deser
method to use something like sourisdb::Cursor
to only take as many bytes as needed from a longer series of bytes.
If we then modify our main
method to print out the serialised version4, we get this:
[2, 232, 3]
That makes sense! We get the two from the two bytes used, then the same bytes from earlier. Congrats! We’ve just managed to turn an 8-byte usize
into a 3-byte Vec<u8>
!
Vec<u8>
-> Integer
⌗
But what about the other way around?
This is even easier! We just need to read in the length, then read in those bytes and add them to an Integer
.
impl Integer {
pub fn deser (bytes: &[u8]) -> Self {
let len = bytes[0] as usize;
let mut content = [0_u8; MAX_BYTES];
for (i, b) in (&bytes[1..(len + 1)]).iter().copied().enumerate() {
content[i] = b;
}
Self {
content,
bytes_used: len
}
}
}
If we go back to our main method and deserialise the serialised bytes, we get this out:
Integer { content: [232, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], bytes_used: 2 }
Which is the same as earlier!
Integer
-> usize
⌗
Our last step is to get it back into a usize
. This is the first step where we could fail - we need to make sure that our destination type has enough space. For example, we wouldn’t be able to store our current test value inside of a u8
.
Let’s start with a skeleton.
impl TryFrom<Integer> for usize {
type Error = ();
fn try_from (n: Integer) -> Result<usize, Self::Error> {
todo!("AARRGGHH");
}
}
Because we’re only dealing with one error type right now, and this is just a proof-of-concept, I just used ()
as the error type.
So, earlier we used usize::to_le_bytes
to get our usize
into some bytes, so we should probably use usize::from_le_bytes
to turn some bytes into a usize
.
As with always, we can just create enough bytes to fill the method signature, then just fill them in!
const T_BYTES: usize = usize::BITS as usize / 8;
if n.bytes_used > T_BYTES {
return Err(());
}
let mut output = [0_u8; T_BYTES];
for (i, b) in n.content.into_iter().take(n.bytes_used).enumerate() {
output[i] = b;
}
Ok(usize::from_le_bytes(output))
If we try that in this main method:
let int: Integer = 1000.into();
let sered = int.ser();
let deser = Integer::deser(&sered);
let back_out: usize = deser.try_into().unwrap();
println!("{back_out}");
We get $1000$ out, and since $1000 \equiv 1000$, we should be all good?
Let’s check⌗
This is a great opportunity to introduce proptest
, which we’ll be using to work out whether our algorithm actually works, or only works on certain inputs.
We’re pretty sure it works on 1000, but we haven’t checked any other number. We could write a test to check every number, but that would take far too long5 to be reasonable to be running as frequently as tests should run.
proptest
is designed to check a random set of inputs from a set of all possible inputs, and when it finds an error, it’ll try to ‘reduce’ it down to the simplest input that results in an error. It isn’t guaranteed to find all the edge cases, but it can be incredibly useful for checking lots of cases without having to basically re-write the test harness on every run.
We can add proptest
to our Cargo.toml
following their instructions, and then get started with some tests!
Our main four use-cases are the four stages we went through earlier, so we can have these two tests to test all of that:
- A test to ensure that it can go from
usize
->Integer
->usize
correctly. - A test to go from
usize
->Integer
->Vec<u8>
->Integer
->usize
.
If these tests fail, we can also add more tests to diagnose exactly where the error comes from.
This is what I’d write for that:
#[cfg(test)]
mod tests {
use super::*;
use proptest::prelude::*;
proptest!{
#[test]
fn check_in_memory (n: usize) {
let int = Integer::from(n);
let back = usize::try_from(int).unwrap();
prop_assert_eq!(back, n);
}
#[test]
fn check_serialised (n: usize) {
let int = Integer::from(n);
let sered = int.ser();
let back_int = Integer::deser(&sered);
let back_n = usize::try_from(back_int).unwrap();
prop_assert_eq!(back_n, n);
}
}
}
If we then cargo test
or even cargo nextest run
, we see that all of our tests pass! Since we’re checking all of the functionality here6, I’m confident that we can move on!
Small Numbers⌗
There’s one thing that’s annoying me about this right now. If we were to try to expand this to u8
, most numbers would have to take two bytes - one for the length (which is always one) and one for the content.
However, there’s actually a trick we can modify from the bincode
specification.
We always know that our integers can be no more than 8 bytes long, so our first byte will always be in the range 0..=8
. There’s 247 numbers we aren’t using there, so we can say that, for the first byte, any number in the range 0..=247
represents that number, but any number greater than that represents $247 + x$ where $x$ is the length.
For convenience, we can also add a constant to the top of our file that holds 247
- this also makes it clearer to future programmers why the constant is the way it is.
const ONE_BYTE_MAX_SIZE: usize = u8::MAX as usize - MAX_BYTES;
To implement this, we can just add this if statement to the top of our ser
method, and change the length serialisation below:
if self.bytes_used <= 1 {
let first_byte = self.content[0];
if first_byte as usize <= ONE_BYTE_MAX_SIZE {
return vec![first_byte];
}
}
let mut output = vec![];
output.push((self.bytes_used + ONE_BYTE_MAX_SIZE) as u8);
output.extend(&self.content[0..self.bytes_used]);
output
And change our deser
method to do some checks to the len
and fix it if it is representing a length:
let first_byte = bytes[0];
let mut content = [0_u8; MAX_BYTES];
if first_byte as usize <= ONE_BYTE_MAX_SIZE {
content[0] = first_byte;
let bytes_used = usize::from(content[0] > 0);
return Self {
content, bytes_used
};
}
let len = first_byte as usize - ONE_BYTE_MAX_SIZE;
for (i, b) in (&bytes[1..(len + 1)]).iter().copied().enumerate() {
content[i] = b;
}
Self {
content,
bytes_used: len
}
We can then re-run our tests, and it all works!
Left to the Reader⌗
Whilst this code all works, there are several things that would need to be done before I’d use it in a production environment:
Error Handling⌗
Right now almost every method is infallible, but there are a ton of methods that have failure modes. For example, what if the bytes
in deser
don’t have enough bytes?
Utility Traits⌗
This code is pretty hard to use in a larger codebase right now - ideally you would implement a number of traits like FromStr
, Display
, Debug
, and maybe even some of the std::ops
traits.
More types⌗
Currently, this code only supports dealing with usize
s7 - what about u16
s or u8
s or even u128
s? Ideally, you’d implement some kind of macro to convert the code - it shouldn’t be too hard, as you just need to change some of the types as the code isn’t really that dependent on it being a usize
.
Negative Numbers⌗
Negative numbers are difficult, but possible - if this post gets much interaction I’ll share how I managed to get them working. If you decide to have a go at it, don’t underestimate it - that algorithm took me 4 iterations to actually get something that worked and that I understood. If you want to see a hint of my pain, check the blame for that file and weep.
My Production code⌗
As of writing, this is my latest iteration - it includes all of the above, and has functioned incredibly effectively as a backbone for all of sourisdb
’s integer serialisation needs. As you can see, once you implement the extra traits it balloons to almost a thousand lines of code (including docs/tests).
Assuming a 64-bit computer, and 8-bit bytes. ↩︎
In such a way that we cannot fail. ↩︎
Skill issue? 100% yes. ↩︎
Left as an exercise to the reader. ↩︎
I ran a quick test on my laptop, and it took 1962s to reach $3378000000000$. Assuming it stays at a constant rate (which it likely won’t, because as the numbers get bigger, more bytes will have to be copied), it’ll take around 340 years to reach
usize::MAX
on my laptop. ↩︎And I’ve already written this algorithm and tested it far more thoroughly in my own time. ↩︎
And yes
u64
s on 64-bit platforms andu32
s on 32-bit platforms. ↩︎