A Brief Look at the Mathematics of Structure Packing
It's common knowledge (depending on what you do) that the memory layout of a structure in C can
change depending on the order its members were declared in. For example, on my x86-64 processor,
sizeof(Foo)
is not equal to sizeof(Bar)
, even though they effectively have the
same members. You can try it out yourself on Godbolt.
struct Foo {
char firstChar;
double firstDouble;
char secondChar;
};
struct Bar {
double firstDouble;
char firstChar;
char secondChar;
};
It's also common knowledge that ordering the members of your structure from largest alignment to smallest will usually (?) give you a size minimizing layout. If this discussion is new to you, there are lots of good articles and videos on the topic, though "The Lost Art of Structure Packing" seems to be the most popular.
For better or worse, my undergraduate math background led me to be curious about when the strategy above is or isn't optimal. More specifically, I had two questions:
-
Does ordering structure members from largest to smallest alignment always give a size minimal layout?
As most people know, the answer is no, and it is not hard to construct a counterexample. But we can describe a class of "simple" structures where the answer always is yes! -
Clang's optin.performance.Padding analyzer uses a slightly different algorithm than what is commonly recommended to find an order that minimizes a structure's size. Does this algorithm always find a size minimal layout?
It turns out the answer is still no! Once again, we can construct an admittedly contrived counterexample that we can optimize better by hand.
I tried to find pre-existing mathematical answers to the problems above, but I didn't have much luck outside of people concluding the correctness of these algorithms from trying them out on a couple of very simple examples, giving handwavy proofs that missed edge cases, or just calling the problem trivial.
It's almost certain that mathematical answers are available in literature I'm unfamiliar with. But I wasn't able to find them, so my curiosity led me to try and give answers to the problems above on my own. It was definitely a good homework problem for me, at the very least!
The rest of this blog post fills in the details needed for providing an answer to the first question above. We don't need any powerful mathematical tools here - because we add so many restrictions on the problem, a familiarity with modular arithmetic will be enough. Of course, I haven't done any math in a while, so my skills may be rusty and the proofs may contain errors. Please let me know if you find any. 🙂
Table of Contents
Disclaimers
This can become a complicated topic if the scope is too wide, so let's narrow the scope a bit.
-
I will not analyze structures with bitfield members. From a cursory reading, it seems like the layout of bitfield members in a structure is a complicated implementation-defined topic. I don't really have the knowledge to reason about this in any real way across every potential target out there. So let's ignore this case for now.
-
I will not make any claims about the 'performance' of a size minimal layout. Performance is obviously a complicated topic, and what is 'performant' can change depending on the metric you are defining performance by. Even worse, designing good experiments is famously hard! The hope is, however, that a size minimal layout will increase the density of data in cache, which might make your memory-bound workload faster.
-
Figuring out an "optimally performing" layout of a structure seems to be an active area of research. You can search for the keywords structure splitting and field reordering if you're curious. The literature seems to suggest that it's often smarter to find structure layouts accounting for access patterns rather than purely minimizing size.
-
-
I will assume that we care about alignment. If not, we can trivially solve the problem by adding
__attribute__((__packed__))
to the structure. There has been some discussion questioning how important alignment is on modern processors, but to my understanding there are still platforms that trap upon unaligned reads or writes. In any case, let's assume we want to find a size minimal layout while respecting alignment requirements.- In case you're curious, it seems you can make x86 trap on unaligned reads or writes if you want! The mentioned flag was also present in the manual for my Ryzen 5950x processor, anyway.
Vocabulary
We'll use:
- \(S\) to denote some arbitrary structure with \(n \geq 0\) members.
- Again, for the sake of simplicity, let's ignore structures with bitfield members for now.
- \(m_i\) to denote the \(i\)th member of the structure, with \(0 \lt i \leq n\).
- \(s_i\) to denote the
sizeof
member \(m_i\). - \(a_i\) to denote the alignment of member \(m_i\), where \(a_i=2^{k_i}\) for some integer \(k_i \geq 0\)
- My hope is that restricting \(a_i\) to be a power of 2 is a reasonable assumption. I don't know if there are any exotic architectures where that doesn't apply, but if there are this blog post does not apply to those architectures.
- \(a_\text{max}\) to denote the maximum alignment out of all \(a_i\) in \(S\). In other words, \(a_\text{max}\) is the smallest integer such that \(a_\text{max} \geq a_i\) for all \(i\).
- \(p_i\) to denote the padding between members \(m_i\) and \(m_{i+1}\) (or the trailing padding if \(m_i\) is the last member of the structure)
Some might be curious why we make a distinction between the size \(s_i\) and alignment \(a_i\)
of structure members, as I see people conflate the two sometimes. This is because they aren't always equal. For example, executing this program
on Godbolt
shows that when compiling for 32 bit systems with the -m32
flag, gcc
will report that
sizeof(double)
is 8 bytes but the alignof(double)
is 4 bytes.
Formulating a mathematical definition of sizeof
We can't do any mathematics here unless we know how to write sizeof
down as an equation.
Let's attempt to formulate one, and we'll then prove an important property of sizeof
that
is normally taken for granted.
A potential ambiguity in the definition of sizeof
First off, the definition of sizeof
from 6.5.3.4, paragraph 4 of the C23 standard (PDF) is:
When [sizeof is] applied to an operand that has structure or union type, the result is the total number of bytes in such an object, including internal and trailing padding.
However, unless I'm grossly misunderstanding something, I think this definition ambiguous.
For example, consider the following structure Foo
in an x86-64 environment:
struct Foo {
short a; // Takes up bytes 0 and 1
int b; // Pad to a 4 byte boundary, then takes up bytes 4 through 7
short c; // No padding needed. Takes up bytes 8 and 9.
// Trailing padding needed to make sure consecutive copies of Foo in
// an array are aligned. As the alignment of Foo is 4,
// add 2 final bytes of padding.
}
The sizeof
this structure should be 12 bytes. Indeed, this is how sizeof
is normally computed (try it out on Godbolt),
and this computation is correct as long as the structure Foo
is aligned on the largest alignment
of its members, which happens to be \(a_\text{max}=4\).
Here is another way to state this. Let \(M\) be the memory address that Foo
starts at. Then this computation is correct as long
as:
$$M \equiv 0\pmod{4}$$
and we could visualize the sizeof
computation above as follows (hope you can forgive the sloppy Inkscape):
If you're confused why we need to add 2 bytes of trailing padding, suppose that we have an array of
Foo
. Without the trailing padding in the example above, the int
in the second instance of Foo
would be misaligned.
However, suppose that we hypothetically had a memory allocator that could allocate a memory address \(M\) such that
$$M \equiv 2\pmod{4}$$
That is, it could allocate memory such that the address it returns would have a remainder of 2 when
divided by 4. Then, if we start Foo
at that memory address, it would have size of 8 bytes:
- The first member
short a
would already be aligned, and would take up bytes 2 and 3. - The second member
int b
would already be aligned because it would start at a memory address divisible by 4, and would take up bytes 4 through 7. - The third member
short c
would already be aligned, and would take up bytes 8 and 9. - There is no trailing padding needed, since in an array the first instance of
Foo
would end on a memory address that has remainder 2 when divided by 4, so we can immediately start the next instance ofFoo
.
We could visualize this computation as follows:
As far as I can see, there is nothing wrong with this scenario besides requiring a hypothetical memory allocator that is flexible enough to support allocations like this.
Indeed, I was curious why it was required for a structure to have an alignment equal to the alignment of its largest members. I failed to find anything in the standard about this, and the only online resources I could find claimed that it was to allow successive members in an array of structures to be aligned.
- (StackOverflow) Is the
alignof
of a struct always the maximumalignof
of its constituents? - (StackOverflow) Why does size of the struct need to be a multiple of the largest alignment of any struct member?
- Yes, I know C and C++ are very different. I think this discussion is still relevant to C though, and ultimately I'm hoping that our discussion here is abstract enough to be language agnostic.
- (Quora) In C, why must the size of a struct be divisible by the size of its largest member?
However, note that in this constructed example we have a structure that is not aligned by the alignment of its largest member, and it still allows sensible array layouts that guarantee the alignment of all structure members.
Of course, it (maybe?) isn't useful to create a memory allocator that is flexible enough to support the
allocations above. The struct Foo
has an ordering of its members that ensures that sizeof(struct Foo) = 8
even when \(M \equiv 0 \pmod{4}\):
// Ordering the members of Foo from largest alignment
// to smallest will give it the smallest size possible.
struct Foo {
int b;
short a;
short c;
}
However, I hope the strange example I gave shows that it isn't immediately obvious that sizeof
will give the same value no matter what memory address we place the structure at, even if we
restrict placing the structure at memory addresses that result in usable array layouts.
From this, I claim that sizeof
is, in reality, a function of two arguments when applied to structures:
$$ \text{sizeof}(S,M) $$
where the first argument \(S\) is the structure in question, and the second argument \(M\) is the memory address the structure starts at. We'll resolve this strange inconsistency later, and we'll prove that:
$$\text{sizeof}(S,0) = \text{sizeof}(S,M)$$
if the memory address \(M\) is divisible by the largest alignment \(a_\text{max}\) in the structure \(S\).
In other words, we will mathematically prove what we usually take for granted: that the value of
sizeof
effectively only takes one argument as long as the structure is aligned in a particular
way.
The equation for dsizeof
Before we tackle the problem of sizeof
, I would like to inspect a simpler concept that will
hopefully make the proof of the consistency of sizeof
easier to digest.
We're going to steal a concept from LLVM
known as __datasizeof
, which is defined to be the sizeof
a structure but without the tail padding.
We'll examine the properties of __datasizeof
first to reduce edge cases needed in our analysis of
sizeof
, and we'll denote this function mathematically as \(\text{dsizeof}\).
Similar to sizeof
, it turns out that \(\text{dsizeof}\) is a function of two arguments:
$$ \text{dsizeof}(S,M) $$
where, once again, the first argument \(S\) is the structure in question, and the second argument \(M\) is the memory address the structure starts at.
To compute an example, let's examine our structure Foo
again, and determine what
\(\text{dsizeof}(\text{Foo}, 0)\) ends up being:
struct Foo {
short a; // Takes up bytes 0 and 1
int b; // Pad to a 4 byte boundary, then takes up bytes 4 through 7
short c; // No padding needed. Takes up bytes 8 and 9.
}
This is basically the same computation as before, except we don't add the 2 bytes of trail padding, so we end up getting:
$$\text{dsizeof}(\text{Foo}, 0) = 10$$
Similarly, if we want to compute \(\text{dsizeof}(\text{Foo}, 2)\), we can recall from the previous section that:
struct Foo {
short a; // Takes up bytes 2 and 3
int b; // No padding needed. Takes up bytes 4 through 7
short c; // No padding needed. Takes up bytes 8 and 9.
}
and so we compute:
$$ \text{dsizeof}(\text{Foo}, 2) = 8 $$
Armed with some examples, let's come up with a general equation for \(\text{dsizeof}\). Let \(M\) be the memory address that a structure \(S\) starts at. Then we can compute \(\text{dsizeof}(S,M)\) as follows:
$$ \text{dsizeof}(S, M) = s_1 + p_1 + s_2 + p_2 + \ldots + s_{n-1} + p_{n-1} + s_n $$
All this is really saying is that the dsizeof
some structure is the sum of the
structure member sizes and the needed padding between those members.
However, for \(0 \lt i \lt n\), we know that \(p_i\) cannot be some arbitrary integer - it must satisfy two constraints. First, the choice of \(p_i\) must make it so the memory address of \(m_{i+1}\) is divisible by \(a_{i+1}\). In other words, the padding must be chosen to make sure the memory address of the structure's next member respects that member's alignment.
If we notice that the expression \(M + s_1 + p_1 + \ldots + s_i + p_i\) represents the starting memory address of member \(m_{i+1}\), we can encode this requirement recursively as:
$$ M + s_1 + p_1 + \ldots + s_i + p_i \equiv 0 \pmod{a_{i+1}} $$
Second, we must ensure that \(p_i\) is the smallest positive solution to the above equation. This is because if we have some integer solution to the above equation \(p_i=k\), then \(p_i=k+ca_{i+1}\) is also a solution for any arbitrary integer \(c\). So we add the final restriction on the value of each \(p_i\):
$$ 0 \leq p_i \lt a_{i+1} $$
which guarantees the uniqueness of \(p_i\).
Proving the consistency of dsizeof
As an intermediary step to proving the consistency of sizeof
, we would like the prove the
following lemma:
Lemma 1: Let \(M\) be any memory address evenly divisible by \(a_\text{max}\), the largest alignment in \(S\). Then we have that: $$ \text{dsizeof}(S,0) = \text{dsizeof}(S,M) $$
Proof: We'll need to come up with some notation to differentiate the paddings in both sides of the equation. For \(0 \lt i \leq n\), we'll write \(p_i\) to denote the paddings inside of \(\text{dsizeof}(S, 0)\), and we'll write \(b_i\) to denote the paddings inside of \(\text{dsizeof}(S, M)\). Thus, our equations become:
$$ \text{dsizeof}(S, 0) = s_1 + p_1 + s_2 + p_2 + \ldots + s_{n-1} + p_{n-1} + s_n $$ $$ \text{dsizeof}(S, M) = s_1 + b_1 + s_2 + b_2 + \ldots + s_{n-1} + b_{n-1} + s_n $$
All we need to show is that for any \(0 \lt i \leq n-1 \), we have \(b_i = p_i\), at which point we know that \(\text{dsizeof}(S, 0) = \text{dsizeof}(S, M)\).
First, recall our restrictions that each padding must satisfy. We know that for each \(p_i\) and \(b_i\):
$$ \begin{align} 0 + s_1 + p_1 + \ldots + s_i + p_i \equiv 0 \pmod{a_{i+1}} \\ M + s_1 + b_1 + \ldots + s_i + b_i \equiv 0 \pmod{a_{i+1}} \end{align} $$ $$ \begin{align*} 0 \leq p_i \lt a_{i+1} \\ 0 \leq b_i \lt a_{i+1} \end{align*} $$
However, in the case of equation (2), we know that by definition \(M\) is divisible by the greatest alignment \(a_{\text{max}}=2^{k_{\text{max}}}\) in \(S\). Since we know that each \(a_i=2^{k_i} \leq 2^{k_{\text{max}}}\), that means that \(M\) is divisible by any \({a_i}\), and so \(M \equiv 0 \pmod{a_i}\) for all \(i\).
So we can instead transform equation (2) into:
$$ \begin{align*} M + s_1 + b_1 + \ldots + s_i + b_i \equiv 0 \pmod{a_{i+1}} \\ 0 + s_1 + b_1 + \ldots + s_i + b_i \equiv 0 \pmod{a_{i+1}} \\ s_1 + b_1 + \ldots + s_i + b_i \equiv 0 \pmod{a_{i+1}} \end{align*} $$
We now have the pair of equations below that we can do mathematical induction over to show that each \(b_i = p_i\):
$$ \begin{align} s_1 + p_1 + \ldots + s_i + p_i \equiv 0 \pmod{a_{i+1}} \\ s_1 + b_1 + \ldots + s_i + b_i \equiv 0 \pmod{a_{i+1}} \end{align} $$
Let's start with the case of \(i=1\). Then we have the equations:
$$ \begin{align} s_1 + p_1 \equiv 0 \pmod{a_2} \\ s_1 + b_1 \equiv 0 \pmod{a_2} \end{align} $$
However, since the lefthand sides of (5) and (6) are both equivalent to \(0 \pmod{a_2}\), we can then write: $$ \begin{align} s_1 + p_1 \equiv s_1 + b_1 \pmod{a_2} \\ p_1 \equiv b_1 \pmod{a_2} \end{align} $$
So we know that \(p_1\) and \(b_1\) are in the same equivalence class. However, by our constraints on the padding between structure members, we know that \(0 \leq p_1 \lt a_2\), and \(0 \leq b_1 \lt a_2\). From there, we know that \(p_1=b_1\), and we are done with the base case.
Now, we need to handle the induction step. We need to show that, for any \(1 \lt i \leq n - 1\), if \(p_{j}=b_{j}\) for all \(1 \leq j \lt i\), then \(p_{i}=b_{i}\). However, we can solve this by similar techniques used to prove the base case. Recall from equations (3) and (4) that we have:
$$ \begin{align*} s_1 + p_1 + \ldots + s_i + p_i \equiv 0 \pmod{a_{i+1}} \\ s_1 + b_1 + \ldots + s_i + b_i \equiv 0 \pmod{a_{i+1}} \\ \end{align*} $$
Once again, since the lefthand sides of (3) and (4) are both equivalent to \(0 \pmod{a_{i+1}}\), we can set them equal to each other:
$$ \begin{equation} s_1 + b_1 + \ldots + s_i + b_i \equiv s_1 + p_1 + \ldots + s_i + p_i \pmod{a_{i+1}} \end{equation} $$
And by the induction step we know that \(b_j=p_j\) for each \(j\), so we can rewrite (9) by substituting each of the \(b_j\) on the lefthand side with \(p_j\):
$$ \begin{equation} s_1 + p_1 + \ldots + s_i + b_i \equiv s_1 + p_1 + \ldots + s_i + p_i \pmod{a_{i+1}} \end{equation} $$
so we can subtract the terms \(s_1 + p_1 + \ldots + s_i\) from both sides of (10) to get:
$$ \begin{equation} b_i \equiv p_i \pmod{a_{i+1}} \end{equation} $$
So \(b_i\) and \(p_i\) are in the same equivalence class. However, once again, because \(0 \leq b_i \lt a_{i+1}\), and \(0 \leq p_i \lt a_{i+1}\), we know that \(p_i=b_i\) for any \(i\). That completes the induction, and since all of the paddings are pairwise equal, we are done with the proof. \(\blacksquare\)
The equation for sizeof
Okay, we now have a mathematical formula for \(\text{dsizeof}(S,M)\) and we know it's consistent when we have
certain restrictions on the memory address \(M\). We would like to use this to come up with an
equation for sizeof
.
This should be simple enough. Since \(\text{dsizeof}\) is simply sizeof
without the
trailing padding, we can write:
$$ \text{sizeof}(S,M) = \text{dsizeof}(S,M) + p $$
where \(p\) represents the trailing padding of the structure. Once again, \(p\) cannot be arbitrary. If we denote by \(a_{\text{max}}\) the maximum alignment of all structure members in \(S\), we must choose \(p\) such that \(0 \leq p \lt a_\text{max}\) and:
$$ M + \text{dsizeof}(S,M) + p \equiv 0 \pmod{a_\text{max}} $$
so that in an array, the next instance of \(S\) will be aligned.
However, recall that the above statement isn't completely correct, as it assumes that the only valid
alignment for \(S\) is for it to be aligned on \(a_\text{max}\). As the discussion in
"A potential ambiguity in the definition of sizeof
" shows,
we can have "valid" alignments of \(S\) that are not just the largest alignment in \(S\).
In order to simplify things, I will again restrict this scope of this blog post - we will only study the mathematics of structures aligned on the largest alignment of their members. Thus, the equation above becomes valid again.
Proving the consistency of sizeof
We finally get to the important lemma that we need before we can even begin to think about finding
the minima of sizeof
.
Lemma 2: Let \(M\) be any memory address evenly divisible by the \(a_\text{max}\), the largest alignment in \(S\). Then we have that: $$ \text{sizeof}(S,0) = \text{sizeof}(S,M) $$ Proof: Expanding the definition of \(\text{sizeof}\), we have:
$$ \begin{align} \text{sizeof}(S,0) = \text{dsizeof}(S,0) + p \\ \text{sizeof}(S,M) = \text{dsizeof}(S,M) + b \end{align} $$
where \(p\) and \(b\) respectively must satisfy the constraints:
$$ \begin{align} \text{dsizeof}(S,0) + p \equiv 0 \pmod{a_\text{max}} \\ M + \text{dsizeof}(S,M) + b \equiv 0 \pmod{a_\text{max}} \end{align} $$
Recall that by Lemma 1 we know that \(\text{dsizeof}(S,0)=\text{dsizeof}(S,M)\), so it suffices to show that the trailing paddings \(p\) and \(b\) are equal.
By our choice of \(M\), we know that \(M \equiv 0 \pmod{a_\text{max}}\), and so with (14) and (15) we have:
$$ \begin{align} \text{dsizeof}(S,0) + p \equiv \text{dsizeof}(S,M) + b \pmod{a_\text{max}} \\ p = b \pmod{a_\text{max}} \end{align} $$
And since we have both \(0 \leq p \lt a_\text{max}\) and \(0 \leq b \lt a_\text{max}\), we know that \(p=b\) and we are done. \(\blacksquare\)
When is ordering by alignment optimal?
We finally have a good mathematical definition of sizeof
, and we have shown that the value of
sizeof
is consistent as long as we align the structure \(S\) on the largest alignment of its
members \(a_\text{max}\). Let's try to find out when the value of sizeof
is minimized.
Primitive structures
To keep our mathematics simple, we're going to restrict the class of structures that we study once more. To start with, we're going to define a primitive structure as any structure where, for each member \(m_i\), we have it that \(s_i = ca_i\) for some \(c \geq 0\). In other words, the size of each structure member is a multiple of the same member's alignment.
Loosely speaking, this is a structure whose members are all primitives, and so is one of the
simplest structures we can reason about. Furthermore, no structure members have "unusual alignments"
that were manually given to them by the programmer through specifiers such as alignas
.
For example, Foo
and Bar
would be primitive structures, but Baz
is not:
// Nothing out of the ordinary here, this is definitely primitive.
struct Foo {
int first;
double second;
char third;
}
// First off, recall when targeting 32-bit x86, GCC asserts that double has an alignment
// of 4, so this specifier is not a no-op. Secondly, even though the programmer manually
// overrides the alignment, this is still a primitive structure as the size is a multiple
// of its alignment.
struct Bar {
int first;
alignas(8) double second;
}
// This is definitely not primitive, as (assuming we are on 32-bit or 64-bit systems),
// 32 does not evenly divide into the double's size of 8.
struct Baz {
int first;
alignas(32) double second;
}
You might have already noticed that it is not just structures with primitive members that can be considered primitive structures - structures with fixed array members and nested structures qualify as well, as long as they weren't given unusual alignments. However, we'll get to that later.
Ordering members of a primitive structure by alignment minimizes dsizeof
We're going to prove an intermediary lemma. I've seen people online use this style of argument to justify that ordering the members of a structure from largest to smallest alignment optimizes the size, albeit in a handwavy way. We'll fill in the skipped details here.
Lemma 3: Let \(S\) be a primitive structure aligned on \(a_\text{max}\), the largest alignment in \(S\). Ordering the members of \(S\) from largest to smallest alignment will minimize the value of \(\text{dsizeof(S, M)}\).
Proof: Recall the definition of \(\text{dsizeof}(S, M)\):
$$ \text{dsizeof}(S, M) = s_1 + p_1 + s_2 + p_2 + \ldots + s_{n-1} + p_{n-1} + s_n \ $$
It suffices to show that ordering the members of \(S\) from the largest to smallest alignment will make each of the intermediary paddings \(p_i=0\), for \(0 \lt i \leq n-1\). We do this by mathematical induction.
First, we prove the base case. We want to show that if we choose an ordering of structure members such that \(a_1 \geq a_2\), then \(p_1=0\), where \(p_1\) must satisfy:
$$ \begin{align} M + s_1 + p_1 \equiv 0 \pmod{a_2} \\ 0 + s_1 + p_1 \equiv 0 \pmod{a_2} \\ s_1 + p_1 \equiv 0 \pmod{a_2} \end{align} $$
However, since \(S\) is a primitive structure, we know that \(s_1=c_1a_1\) for some positive integer \(c_1\). Since \(2^{k_1} = a_1 \geq a_2 = 2^{k_2}\), we know that \(a_1\) is evenly divisible by \(a_2\), and thus \(s_1\) is evenly divisible by \(a_2\).
However, by equation (20) we know that \(p_1 \equiv 0 \pmod{a_2}\) and since \(0 \leq p_1 \lt a_2\) we immediately know that \(p_1=0\).
Next, we prove the induction step. We need to show that, for any \(1 \lt i \leq n\), if \(p_{j}=0\) for all \(1 \leq j \lt i\), then \(p_{i}=0\). Recall that \(p_i\) must satisfy the following:
$$ \begin{align} M + s_1 + p_1 + \ldots s_{i-1} + p_{i-1} + s_i + p_i \equiv 0 \pmod{a_{i+1}} \\ 0 + s_1 + p_1 + \ldots s_{i-1} + p_{i-1} + s_i + p_i \equiv 0 \pmod{a_{i+1}} \\ s_1 + p_1 + \ldots s_{i-1} + p_{i-1} + s_i + p_i \equiv 0 \pmod{a_{i+1}} \end{align} $$
But since all \(p_j=0\), we know from equation (23) that:
$$ s_1 + s_2 + \ldots s_{i-1} + s_i + p_i \equiv 0 \pmod{a_{i+1}} $$
and since each \(s_i=c_ia_i\), we can rewrite the above as:
$$ \begin{equation} c_1a_1 + c_2a_2 + \ldots c_{i-1}a_{i-1} + c_ia_i + p_i \equiv 0 \pmod{a_{i+1}} \end{equation} $$
Recall that since we have chosen an order of structure members that goes from largest alignment to smallest alignment, we also have:
$$ a_1 \geq a_2 \geq a_3 \geq \ldots \geq a_{i-1} \geq a_i \geq a_{i+1} $$
And since each alignment is a power of two, we know that \(a_{i+1}\) must evenly divide \(c_1a_1 + c_2a_2 + \ldots c_{i-1}a_{i-1} + c_ia_i\). Putting this together with equation (24) we immediately know that:
$$ p_i \equiv 0 \pmod{a_{i+1}} $$
and we are done, as \(p_i=0\) by our constraints on \(p_i\). \(\blacksquare\)
On minimizing structure size on ternary computers
Notice that, in the above proof, the fact that each \(a_i\) was a power of 2 didn't play any particularly important role, besides ensuring that \(a_i\) is divisible by \(a_j\) if \(i \geq j\).
Assuming a hypothetical ternary computer follows similar rules for alignment as today's binary computers, that seems to imply that the above lemma would be true even if each \(a_i=3^{k_i}\) for some positive \(k_i\). In other words, on such a ternary computer ordering the members of a structure from largest to smallest alignment would minimize \(\text{dsizeof}(S,M)\).
In fact, we could extrapolate that the above lemma would be true as long as each \(a_i=b^{k_i}\) for some base \(b\), and so ordering the members of a primitive structure aligned on \(a_\text{max}\) from largest to smallest alignment would minimize the size even on computers based on powers of \(b\).
Ordering members of a primitive structure by alignment minimizes sizeof
With Lemma 3 in our belt, proving that ordering structure members of a primitive structure by alignment will minimize sizeof
becomes significantly easier.
Lemma 4: Let \(S\) be a primitive structure aligned on \(a_\text{max}\), the largest alignment in \(S\). Ordering the members of \(S\) from largest to smallest alignment will minimize the value of \(\text{sizeof}(S, M)\).
Proof: Let \(S_\alpha\) be the structure formed from taking the members of \(S\) and ordering them from the largest to smallest alignment, and let \(S_\beta\) be the structure formed from any permutation of the members of \(S\). We wish to prove that:
$$\text{sizeof}(S_\alpha, M) \leq \text{sizeof}(S_\beta, M)$$
Expanding the definition of \(\text{sizeof}\), we know that:
$$ \begin{align} \text{sizeof}(S_\alpha,M) = \text{dsizeof}(S_\alpha,M) + p_\alpha \\ \text{sizeof}(S_\beta,M) = \text{dsizeof}(S_\beta,M) + p_\beta \end{align} $$
Furthermore, by our constraints on the tail padding \(p_\alpha\), we know that:
$$ \begin{align} M + \text{dsizeof}(S_\alpha,M) + p_\alpha \equiv 0 \pmod{a_\text{max}} \\ \text{dsizeof}(S_\alpha,M) + p_\alpha \equiv 0 \pmod{a_\text{max}} \\ \end{align} $$
However, if we notice the lefthand side of equation (28) is just the definition of \(\text{sizeof}(S_\alpha, M)\), we know that:
$$ \text{sizeof}(S_\alpha, M) \equiv 0 \pmod{a_\text{max}} $$
And in particular, it means that \(\text{sizeof}(S_\alpha,M) = c_{\alpha}a_\text{max}\) for some positive integer \(c_\alpha\). By a similar argument, we can conclude that \(\text{sizeof}(S_\beta,M) = c_{\beta}a_\text{max}\) for some positive integer \(c_\beta\).
However, recall that \(0 \leq p_\alpha \lt a_\text{max}\), and \(0 \leq p_\beta \lt a_\text{max}\). We can put this information together with equation (28) to notice that:
$$ \begin{equation} (c_{\alpha} - 1)a_\text{max} \lt \text{dsizeof}(S_\alpha, M) \leq c_{\alpha}a_\text{max} \end{equation} $$
and we can copy-paste that argument to come to a similar conclusion about \(\text{dsizeof}(S_\beta, M)\):
$$ \begin{equation} (c_{\beta} - 1)a_\text{max} \lt \text{dsizeof}(S_\beta, M) \leq c_{\beta}a_\text{max} \end{equation} $$
By Lemma 3, we know that \(\text{dsizeof}(S_\alpha, M) \leq \text{dsizeof}(S_\beta, M)\), and if we put that together with equations (29) and (30) we know that:
$$ \begin{gather} (c_{\alpha} - 1)a_\text{max} \lt \text{dsizeof}(S_\alpha, M) \leq \text{dsizeof}(S_\beta, M) \leq c_{\beta}a_\text{max} \\ (c_{\alpha} - 1)a_\text{max} \lt c_{\beta}a_\text{max} \\ (c_{\alpha} - 1) \lt c_{\beta} \end{gather} $$
And since \(c_{\alpha}\) is the smallest integer greater than \(c_{\alpha} - 1\), we in particular have that \(c_{\alpha} \leq c_{\beta}\). Thus:
$$ \text{sizeof}(S_\alpha,M) = c_{\alpha}a_\text{max} \leq c_{\beta}a_\text{max} = \text{sizeof}(S_\beta,M) $$
and we are done with the proof. \(\blacksquare\)
What other structures can be classified as primitive structures?
Remember that our only requirement for a structure to be a primitive structure is that all members \(m_i\) satisfy \(s_i = ca_i\) for some \(c \geq 0\).
If a member is a primitive data type, these conditions are surely true. But these conditions also hold if a structure member is a structure itself. This is because the starting address of a structure \(S\) is always some multiple of \(a_\text{max}\), and since we choose trailing padding for the structure so that the structure ends on a multiple of \(a_\text{max}\), the difference between the starting and ending memory addresses of \(S\) reveals that the size of \(S\) is also a multiple of \(a_\text{max}\).
The above conditions also hold true if a structure member is a fixed array of primitive members. This is because the alignment of the array is just the alignment of the primitive member itself, and the size of the array is just a multiple of the size of the primitive member.
In other words, Lemma 4 is applicable to a wider variety of structures than just those whose members are primitive data types.
Counterexample to the 'ordering by alignment' algorithm
It's easy to generate a counter example for why the 'ordering by alignment' algorithm doesn't always minimize the size of a structure. For example, consider the following structure on a 32-bit system:
struct Foo {
__attribute__((aligned(8))) int someInt;
__attribute__((aligned(8))) double someDouble;
short someShort;
}
It is clear that the members are sorted by alignment, but we can find a layout of Foo
that is
smaller than the layout found previously:
struct Foo {
__attribute__((aligned(8))) int someInt;
short someShort;
__attribute__((aligned(8))) double someDouble;
}
You can try both of these examples on Godbolt.
This might imply that some kind of "gap-filling" algorithm would always give the optimal output. However, you have to be careful here - it is possible that "filling the gap" earlier might result in an unfillable gap later that's larger than the gap we were trying to fill. Clang attempts to write an algorithm like this, and while it's mostly correct, we can still find structures where it fails to find the optimal size.
Counterexample to Clang's optin.performance.Padding Analyzer
As mentioned before, Clang comes with an optional analyzer that will attempt to check for excessively padded structures. The comments in the optin.performance.Padding implementation gives us hints at what it tries to do.
First, here is the promised counterexample on x86-64. The construction is admittedly contrived, and I'm not sure if there are any real life structures that would have a shape similar to it. In any case, we know that Clang's algorithm is not always correct (but is probably good enough for the vast majority of usecases).
I tried to get the static checkers working in Godbolt, but failed to for some reason. So I hope you'll be willing to try this on your own machine.
#include <stdint.h>
#include <stdio.h>
struct Foo {
__attribute__((aligned(8))) uint32_t id;
__attribute__((aligned(8))) uint8_t flags[9];
struct {
uint16_t foo;
uint16_t bar;
uint16_t blah;
};
};
struct FooInefficient {
__attribute__((aligned(8))) uint32_t id;
struct {
uint16_t foo;
uint16_t bar;
uint16_t blah;
};
__attribute__((aligned(8))) uint8_t flags[9];
};
// Clearly "inefficient" structure to test that clang's
// static analyzer is running at all.
struct Baz {
char firstChar;
double firstDouble;
char secondChar;
};
int main(void) {
printf("Size of struct Foo: %zu\n", sizeof(struct Foo));
printf("Size of struct FooInefficient: %zu\n", sizeof(struct FooInefficient));
}
To run the clang
padding checker, you can invoke the following command, assuming you copied the
above code into the file main.c
:
clang --analyze -Xclang -analyzer-checker=optin.performance.Padding -Xclang -analyzer-config -Xclang optin.performance.Padding:AllowedPad=0 main.c
On my x86-64 machine, notice how there are only warnings for the structure Baz
, and none for FooInefficient
:
$ clang --analyze -Xclang -analyzer-checker=optin.performance.Padding -Xclang -analyzer-config -Xclang optin.performance.Padding:AllowedPad=0 main.c
main.c:24:8: warning: Excessive padding in 'struct Baz' (14 padding bytes, where 6 is optimal). Optimal fields order: firstDouble, firstChar, secondChar, consider reordering the fields or adding explicit padding members [optin.performance.Padding]
24 | struct Baz {
| ~~~~~~~^~~~~
25 | char firstChar;
| ~~~~~~~~~~~~~~~
26 | double firstDouble;
| ~~~~~~~~~~~~~~~~~~~
27 | char secondChar;
| ~~~~~~~~~~~~~~~~
28 | };
| ~
1 warning generated.
However, if you run the compiled binary, we can see that FooInefficient
is clearly larger than Foo
, and
the static analyzer did not realize that there existed a smaller layout for FooInefficient
:
$ ./a.out
Size of struct Foo: 24
Size of struct FooInefficient: 32
Further Questions
There were a couple of interesting questions to explore, but I decided against including them in this blog post as it was already getting long.
Ability of other field reordering algorithms to minimize size
While we gave a counterexample to the correctness of Clang's algorithm, it would be interesting to
mathematically characterize a group of structures where Clang's algorithm can always minimize their
size. It would be interesting to try and analyze the algorithm of pahole
as well, although I
vaguely remember it not accounting for alignas
specifiers (you should double check, I could be
remembering incorrectly).
Furthermore, the problem of reordering field members to minimize size is clearly not unique to C. It would be interesting to:
- Analyze the ability of Java's layout algorithm to minimize size
- There is a thorough writeup about this in Oracle's Java Bug Database if you're curious.
- As mentioned earlier, it seems like most research into Java layout organization seems to favor layouts that are friendly to the access patterns of the program. This is probably better for performance than purely trying to minimize size, and so perhaps this line of questioning isn't useful.
- Analyze the ability of Rust's layout algorithm to minimize size
- Again, there is a thorough writeup about this by Austin Hicks if you're curious (although probably a bit outdated).
- Try to come up with a layout algorithm for C++ classes, and characterize what kind of data
structures it will create an optimal layout for.
- This might be trivial to do once we know how to optimize a C structure layout, but I don't really know enough about the memory layout of C++ classes to say.
Usefulness of memory allocators flexible enough to allocate memory at exotic alignments
Earlier, we noticed that there are other valid alignments for a structure \(S\) that is not just \(a_\text{max}\).
In particular, in our discussion in "A potential ambiguity in the definition of sizeof", we saw
that a given layout of Foo
could be denser depending on what memory address we started it out on.
It turned out to not be useful in that case because we found a layout of Foo
that was just as
dense at \(M \equiv 0 \pmod{4}\) as the initial layout of Foo
was at \(M \equiv 2
\pmod{4}\). However, this begs another question.
Suppose that a structure \(S\) has "valid alignments" at both memory addresses \(M_\alpha\) and \(M_\beta\) such that:
$$ \begin{gather*} M_\alpha \equiv 0 \pmod{a_\text{max}} \\ M_\beta \equiv b \pmod{c} \end{gather*} $$
for some positive integers \(b\) and \(c\). Is it true that the minimal size that we could find when starting \(S\) at \(M_\alpha\) is equal to the minimal size that we could find when starting \(S\) at \(M_\beta\)? Can we ever find denser/smaller layouts of \(S\) at \(M_\beta\) than we could at \(M_\alpha\)?
The answer to the above question might have interesting implications on how flexible memory allocator implementations should be!