Created attachment 49163 [details] preprocessed file triggering the bug Hi, When compiling the following code: /********************************************************************/ typedef struct { double m_a; double m_b; double m_c; double m_d; } AtLeast32BytesObject; AtLeast32BytesObject __attribute__((noinline)) CalledFunction() { AtLeast32BytesObject result = {1.1, 2.2, 3.3, 4.4}; return result; } void __attribute__((noinline)) _start() { volatile AtLeast32BytesObject result = CalledFunction(); while(1) {} } /********************************************************************/ with "arm-none-eabi-gcc -Os -flto -mthumb -mfloat-abi=hard -mcpu=cortex-m4 -ffreestanding -nostdlib -lgcc", the assembly instructions emitted for the symbol "CalledFunction" use callee-save registers r4-r7 to store the result of the CalledFunction procedure (cf following disassemble function addresses range 0x0000805e-0x0000806e). The registers r4-r7 are overwritten when leaving the subroutine (since they're callee-save registers) leading to a corrupted result from "CalledFunction" (cf following disassemble function at address 0x00008072). Dump of assembler code for function CalledFunction: 0x00008000 <+0>: push {r4, r5, r6, r7, lr} 0x00008002 <+2>: ldr r5, [pc, #112] ; (0x8074 <CalledFunction+116>) 0x00008004 <+4>: ldmia r5!, {r0, r1, r2, r3} 0x00008006 <+6>: sub sp, #132 ; 0x84 0x00008008 <+8>: add r4, sp, #64 ; 0x40 0x0000800a <+10>: stmia r4!, {r0, r1, r2, r3} 0x0000800c <+12>: ldmia.w r5, {r0, r1, r2, r3} 0x00008010 <+16>: add r5, sp, #64 ; 0x40 0x00008012 <+18>: stmia.w r4, {r0, r1, r2, r3} 0x00008016 <+22>: ldmia r5!, {r0, r1, r2, r3} 0x00008018 <+24>: add r4, sp, #96 ; 0x60 0x0000801a <+26>: stmia r4!, {r0, r1, r2, r3} 0x0000801c <+28>: ldmia.w r5, {r0, r1, r2, r3} 0x00008020 <+32>: stmia.w r4, {r0, r1, r2, r3} 0x00008024 <+36>: ldr r3, [sp, #96] ; 0x60 0x00008026 <+38>: str r3, [sp, #0] 0x00008028 <+40>: ldr r3, [sp, #100] ; 0x64 0x0000802a <+42>: str r3, [sp, #4] 0x0000802c <+44>: ldr r3, [sp, #104] ; 0x68 0x0000802e <+46>: str r3, [sp, #8] 0x00008030 <+48>: ldr r3, [sp, #108] ; 0x6c 0x00008032 <+50>: str r3, [sp, #12] 0x00008034 <+52>: ldr r3, [sp, #112] ; 0x70 0x00008036 <+54>: str r3, [sp, #16] 0x00008038 <+56>: ldr r3, [sp, #116] ; 0x74 0x0000803a <+58>: ldr r7, [sp, #124] ; 0x7c 0x0000803c <+60>: str r3, [sp, #20] 0x0000803e <+62>: ldr r3, [sp, #120] ; 0x78 0x00008040 <+64>: strd r3, r7, [sp, #24] 0x00008044 <+68>: ldr r3, [sp, #0] 0x00008046 <+70>: str r3, [sp, #32] 0x00008048 <+72>: ldr r3, [sp, #4] 0x0000804a <+74>: str r3, [sp, #36] ; 0x24 0x0000804c <+76>: ldr r3, [sp, #8] 0x0000804e <+78>: str r3, [sp, #40] ; 0x28 0x00008050 <+80>: ldr r3, [sp, #12] 0x00008052 <+82>: str r3, [sp, #44] ; 0x2c 0x00008054 <+84>: ldr r3, [sp, #16] 0x00008056 <+86>: str r3, [sp, #48] ; 0x30 0x00008058 <+88>: ldr r3, [sp, #20] 0x0000805a <+90>: str r3, [sp, #52] ; 0x34 0x0000805c <+92>: ldr r3, [sp, #24] 0x0000805e <+94>: strd r3, r7, [sp, #56] ; 0x38 // HERE, we store 0x00008062 <+98>: ldrd r0, r1, [sp, #32] // the result 0x00008066 <+102>: ldrd r2, r3, [sp, #40] ; 0x28 // in r0-r7 0x0000806a <+106>: ldrd r4, r5, [sp, #48] ; 0x30 // 0x0000806e <+110>: ldr r6, [sp, #56] ; 0x38 // 0x00008070 <+112>: add sp, #132 ; 0x84 0x00008072 <+114>: pop {r4, r5, r6, r7, pc} // HERE, we overwrite r4-r7 0x00008074 <+116>: strh r0, [r5, #4] 0x00008076 <+118>: movs r0, r0 End of assembler dump. I attach to this report the "main.i" containing the previous preprocessed code. The toolchain version is arm-none-eabi-gcc (GNU Arm Embedded Toolchain 9-2020-q2-update) 9.3.1 20200408 (release). It was from the binary package gcc-arm-none-eabi-9-2020-q2-update-mac.pkg downloaded from https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads. The host machine is a MacBook Pro with Catalina version 10.15.4 (19E287). The command lines I used are: arm-none-eabi-gcc main.c -Os -flto -mthumb -mfloat-abi=hard -mcpu=cortex-m4 -ffreestanding -nostdlib -lgcc -save-temps -o a.elf arm-none-eabi-gdb -batch -ex 'file a.elf' -ex 'disassemble CalledFunction' Thanks for your help, Émilie
We need to see the configuration information. What is the output of "gcc -v" for your compiler?
Here they are: arm-none-eabi-gcc -v •[master] Using built-in specs. COLLECT_GCC=/Applications/ARM/bin/arm-none-eabi-gcc COLLECT_LTO_WRAPPER=/Applications/ARM/bin/../lib/gcc/arm-none-eabi/9.3.1/lto-wrapper Target: arm-none-eabi Configured with: /tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/src/gcc/configure --target=arm-none-eabi --prefix=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native --libexecdir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/lib --infodir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/share/doc/gcc-arm-none-eabi/info --mandir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/share/doc/gcc-arm-none-eabi/man --htmldir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/share/doc/gcc-arm-none-eabi/html --pdfdir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/share/doc/gcc-arm-none-eabi/pdf --enable-languages=c,c++ --enable-plugins --disable-decimal-float --disable-libffi --disable-libgomp --disable-libmudflap --disable-libquadmath --disable-libssp --disable-libstdcxx-pch --disable-nls --disable-shared --disable-threads --disable-tls --with-gnu-as --with-gnu-ld --with-newlib --with-headers=yes --with-python-dir=share/gcc-arm-none-eabi --with-sysroot=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/arm-none-eabi --build=x86_64-apple-darwin10 --host=x86_64-apple-darwin10 --with-gmp=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-mpfr=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-mpc=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-isl=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-libelf=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-host-libstdcxx='-static-libgcc -Wl,-lstdc++ -lm' --with-pkgversion='GNU Arm Embedded Toolchain 9-2020-q2-update' --with-multilib-list=rmprofile,aprofile Thread model: single gcc version 9.3.1 20200408 (release) (GNU Arm Embedded Toolchain 9-2020-q2-update)
LTO seems to be getting confused as to the ABI. Investigating... In the mean time, the only work-around I can think of is to remove -flto from your build.
typedef struct { double m_a; double m_b; double m_c; double m_d; } AtLeast32BytesObject; static AtLeast32BytesObject __attribute__((noinline,noclone)) CalledFunction() { AtLeast32BytesObject result = {1.1, 2.2, 3.3, 4.4}; return result; } void __attribute__((noinline)) _start() { volatile AtLeast32BytesObject result = CalledFunction(); while(1) {} } Will miscompile without needing LTO.
When compiling without the lto using the command: arm-none-eabi-gcc main.c -Os -mfloat-abi=hard -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -lgcc -save-temps -o a.elf I get the following instructions for CalledFunction: Dump of assembler code for function CalledFunction: 0x00008000 <+0>: push {r4, r5, lr} 0x00008002 <+2>: ldr r5, [pc, #52] ; (0x8038 <CalledFunction+56>) 0x00008004 <+4>: ldmia r5!, {r0, r1, r2, r3} 0x00008006 <+6>: sub sp, #100 ; 0x64 0x00008008 <+8>: add r4, sp, #32 0x0000800a <+10>: stmia r4!, {r0, r1, r2, r3} 0x0000800c <+12>: ldmia.w r5, {r0, r1, r2, r3} 0x00008010 <+16>: add r5, sp, #32 0x00008012 <+18>: stmia.w r4, {r0, r1, r2, r3} 0x00008016 <+22>: ldmia r5!, {r0, r1, r2, r3} 0x00008018 <+24>: add r4, sp, #64 ; 0x40 0x0000801a <+26>: stmia r4!, {r0, r1, r2, r3} 0x0000801c <+28>: ldmia.w r5, {r0, r1, r2, r3} 0x00008020 <+32>: stmia.w r4, {r0, r1, r2, r3} 0x00008024 <+36>: vldr d0, [sp, #64] ; 0x40 0x00008028 <+40>: vldr d1, [sp, #72] ; 0x48 0x0000802c <+44>: vldr d2, [sp, #80] ; 0x50 0x00008030 <+48>: vldr d3, [sp, #88] ; 0x58 0x00008034 <+52>: add sp, #100 ; 0x64 0x00008036 <+54>: pop {r4, r5, pc} 0x00008038 <+56>: strh r0, [r3, #2] 0x0000803a <+58>: movs r0, r0 End of assembler dump. Which seems correct to me: the result is returned through registers d0-d3. Interesting fact, if I keep the lto but remove the mfloat-abi=hard option: arm-none-eabi-gcc main.c -Os -flto -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -lgcc -save-temps -o a.elf The compilation also seems correct: the result is written at the address given by r0 and the address is returned through r0. Dump of assembler code for function CalledFunction: 0x00008000 <+0>: push {r4, r5, r6, lr} 0x00008002 <+2>: ldr r5, [pc, #20] ; (0x8018 <CalledFunction+24>) 0x00008004 <+4>: mov r6, r0 0x00008006 <+6>: mov r4, r0 0x00008008 <+8>: ldmia r5!, {r0, r1, r2, r3} 0x0000800a <+10>: stmia r4!, {r0, r1, r2, r3} 0x0000800c <+12>: ldmia.w r5, {r0, r1, r2, r3} 0x00008010 <+16>: stmia.w r4, {r0, r1, r2, r3} 0x00008014 <+20>: mov r0, r6 0x00008016 <+22>: pop {r4, r5, r6, pc} 0x00008018 <+24>: strh r0, [r5, #0] 0x0000801a <+26>: movs r0, r0 End of assembler dump.
Yes, the problem is related to returning values in memory and the ABI variants we have. If we have hardware floating-point we generally use registers to return values; if we don't, then we have to return in memory. However, when we have a function that is not inlinable, but is private to the compilation unit we can optimize the ABI in some circumstances. That's what is happening here. Unfortunately, it appears that function that decides whether or not the result should be returned in memory or in registers lacks important information as to whether or not the function is private and this in turn leads to two parts of the compiler making different choices - with the disastrous consequences you've discovered. I'm not sure if this is restricted to M-profile parts or if it's more wide-spread - I'm still investigating.
Hello, Any news on the subject? Would you advise in the meantime to discard the LTO (with the -fno-lto option) on the compilation unit containing the failing code? The bug occurred for us when returning a structure of four doubles. Do you have any indication of when the bug might appear to help us track other occurrences? Thanks for helping!
(In reply to emilie.feral from comment #7) > Hello, > Any news on the subject? > Would you advise in the meantime to discard the LTO (with the -fno-lto > option) on the compilation unit containing the failing code? > The bug occurred for us when returning a structure of four doubles. Do you > have any indication of when the bug might appear to help us track other > occurrences? > Thanks for helping! Sorry, I haven't had time to work on this yet. The safest work-around for now is to add an additional attribute to force the PCS to the default for the selected ABI - I think adding pcs("aapcs-vfp") to the attributes will solve the problem. ie. AtLeast32BytesObject __attribute__((noinline, pcs("aapcs-vfp"))) CalledFunction() { AtLeast32BytesObject result = {1.1, 2.2, 3.3, 4.4}; return result; }
Is there any update on this? I need to turn on LTO to keep the code size of a large application within the flash memory space of the target ARM Cortex M4F processor; but by the sound of it, doing so will be unsafe.
The master branch has been updated by Richard Earnshaw <rearnsha@gcc.gnu.org>: https://gcc.gnu.org/g:1dca4ca1bf2f1b05537a1052e373d8b0ff11e53c commit r12-7894-g1dca4ca1bf2f1b05537a1052e373d8b0ff11e53c Author: Richard Earnshaw <rearnsha@arm.com> Date: Tue Mar 29 16:59:37 2022 +0100 arm: temporarily disable 'local' pcs selection (PR96882) The arm port has an optimization used during selection of the function's ABI to permit deviation from the strict ABI when the function does not escape the current translation unit. Unfortunately, the ABI selection it makes can be unsafe if it changes how a result is returned because not enough information is available via the RETURN_IN_MEMORY hook to determine where the function gets used. This can result in some parts of the compiler thinking a value is returned in memory while others think it is returned in registers. To mitigate this, this patch temporarily disables the optimization and falls back to using the default ABI for the translation. gcc/ChangeLog: PR target/96882 * config/arm/arm.cc (arm_get_pcs_model): Disable selection of ARM_PCS_AAPCS_LOCAL.
As the master branch was updated a year ago according to comment 10, does this mean that there is now a stable release of gcc that incudes the patch?