Optimizing Compilers are Cool (but not perfect)

04 Dec 2015

Note: This is a somewhat older post. The information in it may no longer be accruate.

Update: As of `clang-1500.0.40.1` (2022), LLVM still does not seem to implement any of these optimizations.

I’ve long thought that programmers are too eager to apply optimizations that compilers can trivially generate, with the result that their code is needlessly opaque. For example, I’d be shocked if any optimizing compiler didn’t replace a divide by a power of 2 with a shift. As a result, I’ve never liked the practice of writing:

int halfX = x >> 1;
int xMod128 = x & 127;

instead of

int halfX = x / 2;
int xMod128 = x % 128;

But lately I’ve started to rethink that position. For reference, I used IAR Embedded Workbench for ARM version 7.40.3 targeting a Cortex-M4 (with a VFPv4 single precision FPU), with the optimization set to high / speed (roughly equivalent to -O2 in GCC). The compiler seems to handle simple methods pretty well. Here are two methods. They convert fixed point values with, respectively, 15 digits and 14 digits after the radix point into single-precision floating point values.

float32_t q0115ToFloat(int16_t q0115Value) {
 return q0115Value / 32768.f;
}
float32_t q0214ToFloat(int16_t q0214Value) {
 return q0214Value / 16384.f;
}

And here’s the generated assembler:

00000000 :
0: ee00 0a10 vmov s0, r0
4: eeba 0a60 vcvt.f32.s16 s0, s0, #15
8: 4770 bx lr
0000000a :
a: ee00 0a10 vmov s0, r0
e: eeba 0a41 vcvt.f32.s16 s0, s0, #14
12: 4770 bx lr

This is a pretty useful optimization, The compiler realized that the division could be accomplished with a single VCVT instruction, instead of performing a VCVT followed by a multiply (or even a divide, if it was really not helping at all). And this conversion happens all the time. For example, here’s the source for the ARM CMSIS DSP library method, arm_q15_to_float, which is just an unrolled loop executing the above method (some edits to remove irrelevant code):

void _arm_q15_to_float(
 q15_t * pSrc,
 float32_t * pDst,
 uint32_t blockSize)
{
 q15_t *pIn = pSrc; /* Src pointer */
 uint32_t blkCnt; /* loop counter */
 /* Run the below code for Cortex-M4 and Cortex-M3 */
 /*loop Unrolling */
 blkCnt = blockSize >> 2u;
 /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
 ** a second loop below computes the remaining 1 to 3 samples. */
 while(blkCnt > 0u)
 {
 /* C = (float32_t) A / 32768 */
 /* convert from q15 to float and then store the results in the destination buffer */
 *pDst++ = ((float32_t) * pIn++ / 32768.0f);
 *pDst++ = ((float32_t) * pIn++ / 32768.0f);
 *pDst++ = ((float32_t) * pIn++ / 32768.0f);
 *pDst++ = ((float32_t) * pIn++ / 32768.0f);
 /* Decrement the loop counter */
 blkCnt--;
 }
 /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
 ** No loop unrolling is used. */
 blkCnt = blockSize % 0x4u;
 while(blkCnt > 0u)
 {
 /* C = (float32_t) A / 32768 */
 /* convert from q15 to float and then store the results in the destination buffer */
 *pDst++ = ((float32_t) * pIn++ / 32768.0f);
 /* Decrement the loop counter */
 blkCnt--;
 }
}

How useful is the optimization? I threw together a quick benchmark, which involved filling a 2048-length array of int16_t with random values, and then converting it to float32_t using the above library method. I ran the tests on an STM32F411CC processor, running at 100Mhz. For the first, I naively executed a division for every sample (which turns into a VCVT followed by a VDIV). For the second, I replaced the division by a constant by a multiply by the inverse (which turns into a VCVT followed by a VMUL). For the third, I replaced the entire thing with a single VCVT.

Generated Code	Throughput (ticks / sample)
VCVT followed by VDIV	20.25
VCVT followed by VMUL by 1.f / 32768.f	9.44
VCVT only	7.37

So division is really expensive, which isn’t all that surprising. But even a single extraneous multiply results in a 28% decrease in throughput.

So why the ambivalence in the title? Well, one day I decided that I’d like more than one bit of precision before the radix. No problem. I just copied the existing method and replaced the constant by 16384.f (which is 2^14):

void _arm_q14_to_float(
 q15_t * pSrc,
 float32_t * pDst,
 uint32_t blockSize)
{
 q15_t *pIn = pSrc; /* Src pointer */
 uint32_t blkCnt; /* loop counter */
 /* Run the below code for Cortex-M4 and Cortex-M3 */
 /*loop Unrolling */
 blkCnt = blockSize >> 2u;
 /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
 ** a second loop below computes the remaining 1 to 3 samples. */
 while(blkCnt > 0u)
 {
 /* C = (float32_t) A / 16384 */
 /* convert from q15 to float and then store the results in the destination buffer */
 *pDst++ = ((float32_t) * pIn++ / 16384.0f);
 *pDst++ = ((float32_t) * pIn++ / 16384.0f);
 *pDst++ = ((float32_t) * pIn++ / 16384.0f);
 *pDst++ = ((float32_t) * pIn++ / 16384.0f);
 /* Decrement the loop counter */
 blkCnt--;
 }
 /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
 ** No loop unrolling is used. */
 blkCnt = blockSize % 0x4u;
 while(blkCnt > 0u)
 {
 /* C = (float32_t) A / 16384 */
 /* convert from q15 to float and then store the results in the destination buffer */
 *pDst++ = ((float32_t) * pIn++ / 16384.0f);
 /* Decrement the loop counter */
 blkCnt--;
 }
}

And as we determined earlier, a division by 16384.f turns into a single VCVT. So I spun it up, and… throughput decreased by about 30%! . What happened? Here’s the disassembled output (Address 0x21c corresponds to the start of the first loop body, and s0 contains the constant 16384.f):


00000208 :
 208: b430 push {r4, r5}
 20a: 0893 lsrs r3, r2, #2
 20c: f000 80db beq.w 3c6
 210: f013 0403 ands.w r4, r3, #3
 214: ed9f 0aff vldr s0, [pc, #1020] ; 7f4
 218: f000 802b beq.w 272
 21c: f930 5b02 ldrsh.w r5, [r0], #2
 220: ee00 5a90 vmov s1, r5
 224: eef8 0ae0 vcvt.f32.s32 s1, s1 
 228: ee60 0a80 vmul.f32 s1, s1, s0 
 22c: edc1 0a00 vstr s1, [r1]
 230: f930 5b02 ldrsh.w r5, [r0], #2
 234: ee00 5a90 vmov s1, r5
 238: eef8 0ae0 vcvt.f32.s32 s1, s1 
 23c: ee60 0a80 vmul.f32 s1, s1, s0 
 240: edc1 0a01 vstr s1, [r1, #4]
 244: f930 5b02 ldrsh.w r5, [r0], #2
 248: ee00 5a90 vmov s1, r5
 24c: eef8 0ae0 vcvt.f32.s32 s1, s1 
 250: ee60 0a80 vmul.f32 s1, s1, s0 
 254: edc1 0a02 vstr s1, [r1, #8]
 258: f930 5b02 ldrsh.w r5, [r0], #2
 25c: ee00 5a90 vmov s1, r5
 260: eef8 0ae0 vcvt.f32.s32 s1, s1 
 264: ee60 0a80 vmul.f32 s1, s1, s0 
 268: edc1 0a03 vstr s1, [r1, #12]
 26c: 3110 adds r1, #16
 26e: 1e64 subs r4, r4, #1
 270: d1d4 bne.n 21c
 ;...
 ; lots more, but you get the idea

What the hell? For some reason, the compiler decided to emit a VCVT followed by a VMUL for the divisions by 16384.f, instead of a single VCVT (see the highlighted text). So here’s what the compiler does, as best as I can understand:

Replace a single divide by 32768.f with a single VCVT, check
Replace a loop of divides by 32768.f with a loop of VCVTs, check
Replace a single divide by 16384.f with a single VCVT, check
Replace a loop of divides by 16384.f with a loop of VCVTs, no!

I’m at a loss to explain why it did that, and I am certainly going to start paying closer attention to the instructions that the compiler emits, especially in hot loops.

Postscript

For those wondering how other toolchains do, here’s the full table. A check indicates that the compiler emitted only a single VCVT per conversion.

Toolchain	Single Q1.15	Single Q2.14	Batch Q1.15	Batch Q2.14
IAR	✔	✔	✔	✘
GCC	✔	✔	✘	✘
Clang	✘	✘	✘	✘

Versions

IAR: EWARM 7.40.3
GCC: 7.3.1 20180622
Clang: 900.0.39.2

An Unexpected Cause of an Alignment Fault

09 Nov 2015

I’d like to start off this blog by describing a bug that we ran into a few weeks ago. The details are specific to our processor type, but I think the lessons are somewhat more general. The bug also showed up in an interesting and unexpected way.

The Setup

If you write software, at some point you’ve probably had to serialize and deserialize objects. Libraries like protobuf are great for handling this. But these kinds of libraries are not always available, and sometimes you have to roll your own. In our case, we hadn’t even gone that far, and just had some deserialization code that looked like this:

void deserializeFoo(uint8_t *packet, Foo *foo) {
  uint32_t arg1 = *((uint32_t*) (packet);
  uint32_t arg2 = *((uint32_t*) (packet + 4);
  float32_t arg3 = *((float32_t*) (packet + 8);
  //...
  foo->field1 = arg1;
  //...
}

OK so this isn’t great. But If you’ve spent time writing native code, you’ve probably seen this pattern, and perhaps used it yourself. I consider this an anti-pattern¹ for a couple of reasons: it depends on the endianness of the host system (which makes it hard to port), and it can lead to alignment faults on certain architectures. I won’t dwell on this too much. If you are interested, here is a good writeup. But I do want to write up an error that we saw, because it showed up in a pretty unexpected way.

First, some background (update: this company is sadly now defunct). We’re building a set of wireless earbuds (shameless plug: https://www.hereplus.me/) that let you control how you hear the world. Users can control the earbuds through an iOS app, which sends command messages to the buds. The details aren’t important to this bug, but you can think of the interface as a set of endpoints, each with an associated handler that processes the incoming byte array. These handlers call into the deserialization code, almost all of which used the above pattern.

For a while, everything worked fine. Our processor is a little-endian ARM Cortex-M4, so no issues with endianness. Alignment faults might present a problem, but the incoming packets are all 4-byte aligned, so no problems there. Then, one day, when sending commands to one of the endpoints, the processor would go into a hard fault. Luckily, this issue was very easy to reproduce, and we were quickly able to isolate the breaking change. Here it is (with some minor edits):

void deserializeFoo(uint8_t *packet, Foo *foo) {
  // Parse the id
  uint32_t id = *((uint32_t*) packet);
  // Parse the value
  float32_t value = *((float32_t*) (packet + 4));
  // Hydrate the object
  foo->id = id;
- foo->value = value;
+ // Incoming values are all scaled by 0.5
+ foo->value = 2.f * value;
}

Umm… That sure doesn’t look like it would cause an alignment fault. But sure enough, the fault shows up with that change, and goes away without it. And the fault register clearly shows an alignment fault.

What Happened

The error was really caused by a combination of two changes. In the first, one of our developers combined two of the endpoints into a single one.²

void handleFooBarPacket(uint8_t *packet) {
  switch (packet[0]) {
    case FOO: {
      Foo foo;
      deserializeFoo(packet + 1, &foo);
      //...
    } break;
    case BAR: {
      Bar bar;
      deserializeBar(packet + 1, &bar);
      //...
    } break;
  //...
  }
}

In order to combine the two endpoints, the developer changed the packet protocol to contain a type value as the first byte, and shifted the rest of the packet down a byte. What’s surprising to me is that this change didn’t break anything. I vaguely remembered reading that the Cortex-M4 supports unaligned accesses for non-floating point values. But all of the parameter parsing (including floating point accesses) results in unaligned accesses! And why did it work until one of the parameters was multiplied by 2?

At this point, we were pretty much stuck with looking at the compiler output. Here’s the old, working output for deserializeFoo (general purpose registers are named r*, while floating point registers are named s*):

deserializeFoo:
...
; r3 holds a pointer to a Foo
ldr r1, [r0]
ldr r2, [r0, #4]
str r1, [r3]
str r2, [r3, #4]
...
bx lr

And the broken output:

deserializeFoo:
...
; r3 holds a pointer to a Foo
ldr r1, [r0]
vldr s0, [r0, #4]
vmul s0, s0, s1 ; s1 is a register with the constant 2 in in
str r1, [r3]
vstr s0, [r3, #4]
...
bx lr

The change from LDR (used to load a value into a general purpose register) to VLDR (used to load a value into a floating point register) is pretty suggestive. We stepped through the code in a debugger, and the fault definitely occurred on the VLDR. The issue is actually pretty subtle. The Cortex-M4 chips (can) support unaligned access for the LDR instruction, but not for the VLDR instruction.³ When no intermediate computations were applied to the variable, the compiler figured that it could just generate code to read the (floating point) value into a general purpose register, and then write it out to memory. It only generated a load into a floating point register once it needed to double the value before writing it out. In retrospect, this shouldn’t have surprised me. There’s nothing there that would require a load into a floating point register. I guess I was wired to think that any operation involving a float occurred in a floating point register.

Lessons

I drew a few lessons here. First and most prosaically, grab a library for serialization. If you can’t due to code size constraints, roll your own quick one whose correctness you can easily reason about (for example, that reads one byte at a time and assembles larger primitives using bitshifts). Unless you have a good reason to, don’t rely on fragile constructs like:

*((T*) byte_array)

Another is that the ultimate and proximate causes of errors can be very far apart. In the end, the culprit was the alignment fault that we expected, but the compiler had masked it for a while by avoiding unnecessary floating point loads. This bug could have remained latent in our codebase forever. And last, types are a language-level construct. Just because a value is declared float doesn’t mean that it will be stored in a floating point register.

Footnotes

Thinking a bit more, I can imagine performance sensitive applications (for example, fast compressors / decompressors) where we would want to avoid byte-by-byte access, but let’s ignore those for now.
Endpoints (characteristics, in Bluetooth-speak) are pretty expensive to add, so consolidating them freed up some Bluetooth resources that we needed.
If you’re interested, this is covered in section A3.2.1 of the ARMv7 architecture reference manual.

Older Newer

Pardon the Interrupt Matthew Sills