link garbage collection
link gc can be activated with :
CFLAGS+=-fdata-sections -fno-common CFLAGS+=-ffunction-sections LDOPTS+=-Wl,--gc-sections -Wl,--print-gc-sections -Wl,--entry=entry
The compiler (CFLAGS) options will make each function it is own section. The linker (LDOPTS) option will make the linker, include all code/date used by the entry function, and garbage collect all other code.
These option can be a win on large project, but this imply overhead in code.
In the normal mode, gcc put all code/data of a file in one section. And in this section stuff can't be moved by the linker.
But now gcc don't know how the linker will organise section and can cause overhead.
fdata-sections overhead
For example it will incread code size when accessing global data :
int bar; int titi; int tata=1; int foo=2;bar and titi are in bss tata and foo in dataint toto(void) { return foo+tata+titi+bar; }
arm-none-eabi-gcc -Os -c
00000000 <toto>: 0: e59f3020 ldr r3, [pc, #32] ; 28 <toto+0x28> 4: e8930005 ldm r3, {r0, r2} 8: e0800002 add r0, r0, r2 c: e59f3018 ldr r3, [pc, #24] ; 2c <toto+0x2c> 10: e5933000 ldr r3, [r3] 14: e0800003 add r0, r0, r3 18: e59f3010 ldr r3, [pc, #16] ; 30 <toto+0x30> 1c: e5933000 ldr r3, [r3] 20: e0800003 add r0, r0, r3 24: e12fff1e bx lr 28: 2c: 30:
arm-none-eabi-gcc -Os -fno-common -c
00000000 <toto>: 0: e59f3018 ldr r3, [pc, #24] ; 20 <toto+0x20> 4: e8930005 ldm r3, {r0, r2} 8: e0800002 add r0, r0, r2 c: e59f3010 ldr r3, [pc, #16] ; 24 <toto+0x24> 10: e893000c ldm r3, {r2, r3} 14: e0800002 add r0, r0, r2 18: e0800003 add r0, r0, r3 1c: e12fff1e bx lr 20: 24:
arm-none-eabi-gcc -Os -fno-common -fdata-sections -c
00000000 <toto>: 0: e59f3028 ldr r3, [pc, #40] ; 30 <toto+0x30> 4: e5930000 ldr r0, [r3] 8: e59f3024 ldr r3, [pc, #36] ; 34 <toto+0x34> c: e5933000 ldr r3, [r3] 10: e0800003 add r0, r0, r3 14: e59f301c ldr r3, [pc, #28] ; 38 <toto+0x38> 18: e5933000 ldr r3, [r3] 1c: e0800003 add r0, r0, r3 20: e59f3014 ldr r3, [pc, #20] ; 3c <toto+0x3c> 24: e5933000 ldr r3, [r3] 28: e0800003 add r0, r0, r3 2c: e12fff1e bx lr 30: 34: 38: 3c:
Note that -fno-common can help to generate better code with bss data.
optimisation
- 2 pass build : detect unused stuff and build and optimised version.
- linker to patch the generated code ?
ffunction-sections overhead
Gcc sometimes need to use trampoline.
For example on armv4t, there is not blx instruction. codesourcery arm-2011.03 (elf target) generate code like :
000c7848 <conf_load_defaults>: c7848: b538 push {r3, r4, r5, lr} […] c7870: f000 f812 bl c7898 <memcpy_from_thumb> […] c7888: bc01 pop {r0} c788a: 4700 bx r0and with ffunction-sections, there is lot's of memcpy_from_thumb in different section and the linker doesn't merge them.000c7898 <memcpy_from_thumb>: c7898: 4778 bx pc c789a: 46c0 nop ; (mov r8, r8) c789c: eaff630f b a04e0 <memcpy>
In fact gcc generate
[…] 6: f7ff fffe bl 0 <memcpy> […]and the linker patch the code !!!
Note : there was lot's of memcpy_from_thumb if we din't merge .text* in the linker script.
armv5t
using armv5t, we got
000c538c <conf_load_defaults>: c538c: b538 push {r3, r4, r5, lr} […] c53b4: f7da eea4 blx a0100 <memcpy> […] c53c8: bd38 pop {r3, r4, r5, pc}
other optimisation
build one big source file
make static the default stuff :
- -fwhole-program
agregate all source file in one :
- -combine
Eat lot's of memory
LTO
Extra notes
script to compare code
For comparing function size of 2 binaries, we can use
readelf -W -s prog1.elf | grep FUNC | sort -k8 | sort -n -s -k 3,3 | awk '{ print $3" "$8 }' > dump1 readelf -W -s prog2.elf | grep FUNC | sort -k8 | sort -n -s -k 3,3 | awk '{ print $3" "$8 }' > dump2 diff -u dump1 dump2
Thumb interworking
http://wiki.debian.org/ArmEabiPort#Choice_of_minimum_CPU
Instruction safe for interworking :
- mov pc,lr : starting armv7
- bx lr : starting armv4t
- ldm/ldr : starting armv5t
- blx : starting armv5t
This is a shame that arm did add thumb support from the start for normal branch operation