10 Things You Should Know About Memory Alignment
Memory alignment is a topic that many developers rarely think about.
In high-level languages like Python and Javascript, memory alignment is hidden from the developer. However, if you’re programming in a low-level language like C or C++, you have more control over alignment. This can be helpful in domains like embedded software and game programming, where efficient use of memory and CPU is a top priority.
Here are 10 things you should know about memory alignment.
- 0: What even is unaligned memory access?
- 1: Unaligned memory access is bad
- 2: The compiler will pad your structs for alignment reasons
- 3: You can rearrange struct fields to reduce padding. The compiler won’t do this for you.
- 4: Your compiler can detect unaligned access
- 5: You can tell the compiler to remove padding from structs
- 6: You might need to manually align data in certain scenarios
- 7: You can customize the alignment of data at compile time
- 8: You can’t use
malloc()
for custom alignments - 9: Fast power-of-2 alignment
0: What even is unaligned memory access?
Unaligned memory access occurs when you try to access N bytes of data starting
from an address that is not evenly divisible by N. For example, attempting
to read a uint32_t
from address 0xc021
is an unaligned read, because 0xc021
is not a multiple of 4 (the bottom two bits would have to be 0 for it to be
a multiple of 4).
The alignment requirement of a type will typically be the size of the type.
C11 introduced _Alignof
, which you can use to query the alignment of any
type in code:
#include <stdint.h>
#include <stdio.h>
#define EVAL_PRINT(expr) printf("%-20s = %u\n", #expr, (uint8_t)(expr));
struct foo {
int a;
char b;
float c;
};
int main(void) {
EVAL_PRINT(_Alignof(char));
EVAL_PRINT(_Alignof(uint8_t));
EVAL_PRINT(_Alignof(uint16_t));
EVAL_PRINT(_Alignof(uint32_t));
EVAL_PRINT(_Alignof(int));
EVAL_PRINT(_Alignof(struct foo));
EVAL_PRINT(_Alignof(uint64_t));
EVAL_PRINT(_Alignof(void*));
EVAL_PRINT(_Alignof(size_t));
return 0;
}
When compiling for my x86_64 machine, I get:
$ gcc alignof.c && ./a.out
_Alignof(char) = 1
_Alignof(uint8_t) = 1
_Alignof(uint16_t) = 2
_Alignof(uint32_t) = 4
_Alignof(int) = 4
_Alignof(struct foo) = 4
_Alignof(uint64_t) = 8
_Alignof(void*) = 8
_Alignof(size_t) = 8
Notice that the alignment of struct foo
is set to the alignment of the
largest field.
If I compile in 32-bit mode, I get:
$ gcc -m32 alignof.c && ./a.out
_Alignof(char) = 1
_Alignof(uint8_t) = 1
_Alignof(uint16_t) = 2
_Alignof(uint32_t) = 4
_Alignof(int) = 4
_Alignof(struct foo) = 4
_Alignof(uint64_t) = 4
_Alignof(void*) = 4
_Alignof(size_t) = 4
The only difference is that the alignment of 8-byte types like uint64_t
is 4 now instead of 8, since the word size is 32-bits.
Note that having an array does not change the alignment. For
instance: uint8_t x[16];
has an alignment of 1, not 16.
Unaligned memory access can occur when casting variables to types of different lengths, or when doing pointer arithmetic followed by access to at least 2 bytes of data. This kind of thing is not uncommon to encounter in embedded software or other kinds of low-level software that interface directly or indirectly with hardware.
1: Unaligned memory access is bad
The way an unaligned memory access is handled will depend on your processor architecture (e.g. x86, ARM). Some processors will be able to perform the access transparently in hardware (perhaps with a performance cost), and others will raise an exception which must be handled in software. Sometimes the aligned access will require two memory reads instead of one.
Depending on your processor, an unaligned access might be a minor performance hit, it might crash everything, or even worse, it might silently perform a different memory access than the one you requested.
2: The compiler will pad your structs for alignment reasons
You may already know this, but the compiler inserts padding to your structs in order to properly align things.
There are a few interesting points to highlight:
- You could insert another
char
aftera
without disturbing the layout of the rest of the struct. You could think of the pad byte as basically “free memory” that is not being used. - The overall size of the struct is 12 bytes, even though an optimally packed struct would only take up 8 bytes. That’s 50% bigger, and it might make a difference in if there are lots of these structs in your application.
- There are 3 pad bytes after
d
, even though there’s nothing afterd
in the struct. This is because if you have an array of structs, likestruct foo foos[4]
, thenfoos[1]
will be a properly aligned access (address aligns to a multiple of the alignment of the struct, which is 4).
3: You can rearrange struct fields to reduce padding. The compiler won’t do this for you.
By carefully rearranging struct fields, you can achieve a more optimal packing while maintaining proper alignment.
Taking the prior example, you could do it this way:
In fact, if you order from largest to smallest, then you will always achieve optimal packing.
Truthfully, when I’m creating a struct, I don’t give much thought to reordering fields to save a few bytes. But if you’re planning to instantiate a lot of them, it might make sense. This is probably in the realm of “premature optimization” unless you’ve benchmarked and measured the potential gains by reordering.
It’s tempting to think the compiler will do this for you, but it’s not allowed to, for several reasons.
4: Your compiler can detect unaligned access
Suppose you have this program:
#include <stdint.h>
uint8_t read_as_u8(const uint8_t* data) {
return *(const uint8_t*)data;
}
uint16_t read_as_u16(const uint8_t* data) {
return *(const uint16_t*)data;
}
int main(void) {
uint8_t data[4] = { 0x00, 0x11, 0x22, 0x33 };
read_as_u8(&data[1]);
read_as_u16(&data[1]);
return 0;
}
The call to read_as_u16(&data[1])
will result in an unaligned read because
we are reading from an address that is not aligned to a 2-byte boundary
(i.e. the alignment of a uint16_t
). The call to read_as_u8
will not result
in an unaligned read, because uint8_t
has an alignment of 1.
Will the compiler be able to detect this unaligned access for us?
If you enable -Wcast-align
, you might get a warning. Or you might not.
On my machine (x86_64), and an older version of gcc (7.5.0), I get:
$ gcc -Wcast-align align.c
$
Hmm, it didn’t warn me. There are two reasons for that:
- According to Intel’s Basic Architecture Manual:
Words, doublewords, and quadwords do not need to be aligned in memory on natural boundaries. … However, to improve the performance of programs, data structures (especially stacks) should be aligned on natural boundaries whenever possible.
So, my processor doesn’t have any alignment requirements whatsoever.
- GCC’s documentation
for
-Wcast-align
states:
Warn whenever a pointer is cast such that the required alignment of the target is increased.
Since my target (x86_64) doesn’t require alignment, then GCC didn’t warn me.
But even if the target doesn’t require alignment, there are still potential performance problems with unaligned access (as indicated by the Intel manual), so it would be nice to get the warning.
As of GCC 8, you can use -Wcast-align=strict
to emit a warning even if the
target doesn’t require alignment, which will produce the warning:
<source>: In function 'read_as_u16':
<source>:9:13: warning: cast increases required alignment of target type [-Wcast-align]
9 | return *(const uint16_t*)data;
| ^
Compiler returned: 0
It’s generally recommended to always enable -Wcast-align=strict
(or just
-Wcast-align
if you can’t use =strict
), since there are many targets
which have alignment requirements, and it’s not included in -Wall
or
-Wextra
.
You can also detect unaligned memory access at runtime by compiling
with -fsanitize=alignment
:
$ gcc -Wcast-align -fsanitize=alignment align.c
$ UBSAN_OPTIONS=print_stacktrace=1 ./a.out
align.c:8:12: runtime error: load of misaligned address 0x7ffdebd31b25 for type 'const uint16_t', which requires 2 byte alignment
0x7ffdebd31b25: note: pointer points here
1c d3 eb 00 11 22 33 00 3b 69 96 f6 e9 16 7a 70 08 00 57 50 56 00 00 87 fc 5c 3a f4 7f 00 00 00
^
#0 0x565057000804 in read_as_u16 (/home/nick/src/scratch/align/a.out+0x804)
#1 0x565057000854 in main (/home/nick/src/scratch/align/a.out+0x854)
#2 0x7ff43a5cfc86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
#3 0x5650570006e9 in _start (/home/nick/src/scratch/align/a.out+0x6e9)
It even gives you a helpful memory dump showing exactly where the unaligned access was attempted.
5: You can tell the compiler to remove padding from structs
Suppose you have data that you want to send “over the wire” to another computer. The destination computer may not have the same processor architecture or word size as the source computer, and the endianness might be different too.
You should usually prefer well-known serialization methods like protobuf or CBOR for data exchange, as they handle the packing/unpacking of data for you, but sometimes sending raw binary can make sense.
If you want to send the data as raw binary, then you will generally want to remove all padding, in case the destination processor has different alignment requirements. Otherwise, the receiving end may not be able to read the data.
In these cases, you can tell the compiler to remove all padding with #pragma
pack(1)
:
#include <stdint.h>
#include <stdio.h>
#define EVAL_PRINT(expr) printf("%-20s = %u\n", #expr, (uint8_t)(expr));
struct foo {
char a;
int32_t b;
};
#pragma pack(1)
struct foo_packed {
char a;
int32_t b;
};
#pragma pack()
int main(void) {
EVAL_PRINT(_Alignof(struct foo));
EVAL_PRINT(sizeof(struct foo));
EVAL_PRINT(_Alignof(struct foo_packed));
EVAL_PRINT(sizeof(struct foo_packed));
struct foo_packed f = { 'a', 1 };
EVAL_PRINT(f.b);
return 0;
}
which produces:
$ gcc -Wcast-align=strict -fsanitize=alignment alignof.c && ./a.out
_Alignof(struct foo) = 4
sizeof(struct foo) = 8
_Alignof(struct foo_packed) = 1
sizeof(struct foo_packed) = 5
f.b = 1
Notice that f.b
, which is at offset 1 in the packed struct, is an unaligned
read of an int32_t
, but the compiler and runtime sanitizer did not warn about it.
If you look at it in compiler explorer with
an ARM32 compiler, you’ll notice that it gets tagged as “unaligned”:
ldr r3, [r7, #1] @ unaligned
Here, r7
contains the address of f
, and we start reading the int32_t
from offset 1.
It seems odd to me that the runtime sanitizer did not flag this, but perhaps this is by design.
As you might be able to tell, sending raw binary like this between computers is wrought with peril. And that’s not even considering potential endianness mismatches. It can work if you know what you’re doing.
Proceed with caution when using #pragma pack(1)
, and only use it if you have
a good reason.
6: You might need to manually align data in certain scenarios
Although standard alignment works most of the time, there are many scenarios where you will need to align data even more strictly.
A few specific examples:
- Hardware peripheral requires buffers to be 16-, 32-, or 64-byte aligned
- Placement of ARM interrupt vector table requires 128-byte alignment
- SIMD and SSE instructions require data to be 16-byte aligned
- Performance tuning by aligning data to the cache line size (e.g. 64 bytes)
- Eliminate “false sharing” contention in multi-core applications
There are many other examples, but these are some of the most common.
7: You can customize the alignment of data at compile time
In C11, _Alignas
was introduced, which provides a convenient way to specify
the alignment of a type or instance of type.
For example:
#include <stdint.h>
#include <stdio.h>
#define EVAL_PRINT(expr) printf("%-20s = %u\n", #expr, (uint32_t)(expr));
struct foo {
char a;
int32_t b;
_Alignas(16) float sse_data[4];
};
int main(void) {
EVAL_PRINT(_Alignof(struct foo));
EVAL_PRINT(sizeof(struct foo));
_Alignas(2048) struct foo f; // this instance has even stricter alignment
EVAL_PRINT(_Alignof(f));
EVAL_PRINT(sizeof(f));
return 0;
}
and the result:
_Alignof(struct foo) = 16
sizeof(struct foo) = 32
_Alignof(f) = 2048
sizeof(f) = 32
If you’re pre-C11, then you can use compiler extensions (e.g. for gcc
):
// struct alignment
struct sse_data {
float data[4];
} __attribute__ ((aligned (16)));
struct foo {
char a;
int32_t b;
struct sse_data sse_data;
};
// variable alignment
int x __attribute__ ((aligned (64)));
8: You can’t use malloc()
for custom alignments
The only guarantee malloc()
provides is that the memory will be aligned for
any built-in type, which typically is just the word size of the processor
(4 bytes on a 32-bit system).
If you need alignment beyond this, you will have to use a different allocator.
If you’re using C11 or later, then you can use the built-in aligned_alloc
function:
#include <stdlib.h>
void *aligned_alloc(size_t alignment, size_t size);
If you are using this, be aware that calling realloc
on a pointer returned
from aligned_alloc
is not guaranteed to have the same alignment, so it’s
better to free
and call aligned_alloc
again in that case.
If you’re not using C11, then you are on your own. You will have to write a custom allocator (or use a third-party one).
Fortunately, it’s not too difficult to write your own version of aligned_alloc
.
In fact, implementing aligned_alloc
(and the corresponding aligned_free
)
is a very popular interview question for embedded software engineers.
void *aligned_alloc(size_t align, size_t size);
void aligned_free(void* ptr);
A typical implementation will involve calling normal malloc()
,
skipping to the next multiple of the requested alignment, then returning that
pointer instead. The tricky part is
that when aligned_free()
is called, you need get the original pointer
that malloc()
returned in order to call free()
. There’s a technique where
you can store the original pointer just before the aligned pointer in memory.
This is a generally useful technique, but it comes up particularly often when
designing custom allocators.
9: Fast power-of-2 alignment
If you’re writing your own custom allocator for alignment reasons, you’re going to want to have a fast way to jump to the next alignment boundary (which should always be a power of 2).
A simple, straightforward approach would look something like this (copied from this great post):
uintptr_t align_forward(uintptr_t ptr, size_t align) {
uintptr_t p, a, modulo;
assert(is_power_of_two(align));
p = ptr;
a = (uintptr_t)align;
// Same as (p % a) but faster as 'a' is a power of two
modulo = p & (a-1);
if (modulo != 0) {
// If 'p' address is not aligned, push the address to the
// next value which is aligned
p += a - modulo;
}
return p;
}
The only real issues with this are the branches that happen in the
assert
and the if
statement.
A more efficient way to implement this, which removes all branching, is:
uintptr_t align_forward(uintptr_t ptr, size_t align) {
return (ptr + align - 1) & ~(align - 1);
}
The second half of the statement (& ~(align -1)
) is rounding down to the
nearest multiple of align
by setting the bottom bits to zeros.
The first half of the statement is a bit harder to understand.
For example purposes, assume align is 32
and consider the edge cases:
ptr
is already aligned: in this case, adding 31 toptr
and setting the bottom 5 bits to 0 returns the original value ofptr
.ptr
is offset by 1 byte: in this case, when 31 is added, it bumpsptr
up exactly to the next multiple ofalign
. Setting the bottom bits to zero has no effect.ptr
is offset by 31 bytes: in this case, when 31 is added, it bumpsptr
past the next multiple ofalign
(next multiple + 30). Setting the bottom bits to 0 rounds it back down (remove the offset of 30).
References:
- https://en.wikipedia.org/wiki/Data_structure_alignment
- https://www.kernel.org/doc/Documentation/unaligned-memory-access.txt
- https://www.gingerbill.org/series/memory-allocation-strategies/
- https://www.gamedeveloper.com/programming/data-alignment-part-1
- https://www.gamedeveloper.com/programming/data-alignment-part-2-objects-on-the-heap-and-the-stack
- https://embeddedartistry.com/blog/2017/02/22/generating-aligned-memory/