link garbage collection
link gc can be activated with :
CFLAGS+=-fdata-sections -fno-common CFLAGS+=-ffunction-sections LDOPTS+=-Wl,--gc-sections -Wl,--print-gc-sections -Wl,--entry=entry
The compiler (CFLAGS) options will make each function it is own section. The linker (LDOPTS) option will make the linker, include all code/date used by the entry function, and garbage collect all other code.
These option can be a win on large project, but this imply overhead in code.
In the normal mode, gcc put all code/data of a file in one section. And in this section stuff can't be moved by the linker.
But now gcc don't know how the linker will organise section and can cause overhead.
fdata-sections overhead
For example it will incread code size when accessing global data :
int bar; int titi; int tata=1; int foo=2;bar and titi are in bss tata and foo in dataint toto(void) { return foo+tata+titi+bar; }
arm-none-eabi-gcc -Os -c
00000000 <toto>: 0: e59f3020 ldr r3, [pc, #32] ; 28 <toto+0x28> 4: e8930005 ldm r3, {r0, r2} 8: e0800002 add r0, r0, r2 c: e59f3018 ldr r3, [pc, #24] ; 2c <toto+0x2c> 10: e5933000 ldr r3, [r3] 14: e0800003 add r0, r0, r3 18: e59f3010 ldr r3, [pc, #16] ; 30 <toto+0x30> 1c: e5933000 ldr r3, [r3] 20: e0800003 add r0, r0, r3 24: e12fff1e bx lr 28: 2c: 30:
arm-none-eabi-gcc -Os -fno-common -c
00000000 <toto>: 0: e59f3018 ldr r3, [pc, #24] ; 20 <toto+0x20> 4: e8930005 ldm r3, {r0, r2} 8: e0800002 add r0, r0, r2 c: e59f3010 ldr r3, [pc, #16] ; 24 <toto+0x24> 10: e893000c ldm r3, {r2, r3} 14: e0800002 add r0, r0, r2 18: e0800003 add r0, r0, r3 1c: e12fff1e bx lr 20: 24:
arm-none-eabi-gcc -Os -fno-common -fdata-sections -c
00000000 <toto>: 0: e59f3028 ldr r3, [pc, #40] ; 30 <toto+0x30> 4: e5930000 ldr r0, [r3] 8: e59f3024 ldr r3, [pc, #36] ; 34 <toto+0x34> c: e5933000 ldr r3, [r3] 10: e0800003 add r0, r0, r3 14: e59f301c ldr r3, [pc, #28] ; 38 <toto+0x38> 18: e5933000 ldr r3, [r3] 1c: e0800003 add r0, r0, r3 20: e59f3014 ldr r3, [pc, #20] ; 3c <toto+0x3c> 24: e5933000 ldr r3, [r3] 28: e0800003 add r0, r0, r3 2c: e12fff1e bx lr 30: 34: 38: 3c:
Note that -fno-common can help to generate better code with bss data.
optimisation
- 2 pass build : detect unused stuff and build and optimised version.
- linker to patch the generated code ?
ffunction-sections overhead
Gcc sometimes need to use trampoline.
For example on armv4t, there is not blx instruction. codesourcery arm-2011.03 (elf target) generate code like :
000c7848 <conf_load_defaults>: c7848: b538 push {r3, r4, r5, lr} […] c7870: f000 f812 bl c7898 <memcpy_from_thumb> […] c7888: bc01 pop {r0} c788a: 4700 bx r0and with ffunction-sections, there is lot's of memcpy_from_thumb in different section and the linker doesn't merge them.000c7898 <memcpy_from_thumb>: c7898: 4778 bx pc c789a: 46c0 nop ; (mov r8, r8) c789c: eaff630f b a04e0 <memcpy>
In fact gcc generate
[…] 6: f7ff fffe bl 0 <memcpy> […]and the linker patch the code !!!
Note : there was lot's of memcpy_from_thumb if we din't merge .text* in the linker script.
armv5t
using armv5t, we got
000c538c <conf_load_defaults>: c538c: b538 push {r3, r4, r5, lr} […] c53b4: f7da eea4 blx a0100 <memcpy> […] c53c8: bd38 pop {r3, r4, r5, pc}
other optimisation
build one big source file
make static the default stuff :
- -fwhole-program
agregate all source file in one :
- -combine
Eat lot's of memory
LTO
Extra notes
script to compare code
For comparing function size of 2 binaries, we can use
readelf -W -s prog1.elf | grep FUNC | sort -k8 | sort -n -s -k 3,3 | awk '{ print $3" "$8 }' > dump1 readelf -W -s prog2.elf | grep FUNC | sort -k8 | sort -n -s -k 3,3 | awk '{ print $3" "$8 }' > dump2 diff -u dump1 dump2
Thumb interworking
http://wiki.debian.org/ArmEabiPort#Choice_of_minimum_CPU
Instruction safe for interworking :
- mov pc,lr : starting armv7
- bx lr : starting armv4t
- ldm/ldr : starting armv5t
- blx : starting armv5t
This is a shame that arm did add thumb support from the start for normal branch operation
Installation of debian on hp elitebook
Partition
The pc got 4 primary partition : 300M Win7 boot 220G Win7 C: 40G HP_RESCUE 2G HP_TOOLS for efi stuff : http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01564727/c01564727.pdf?jumpid=reg_R1002_USEN
I want to keep all the partition and install linux, for that :
- we need to destroy one primary partition to create an extended one
- we choose HP_TOOLS, that could be in extended one
- we skrink Win7 C: (using windows tools), but because of an unmovable file we can get only 110G free
- we don't care getting 2G of delete HP_TOOLS
So we got :
- 300M Win7 boot
- 110G Win7 C:
- extended
- 106G linux
- 2G swap
- 2G HP_TOOLS
- 40G HP_RESCUE
- 2G free
Debian installer (testing) : encryption
I tried to encrypt my system, but this is a mess :
- the installer want the swap encrypted
- in order to support suspend to disk, we can't set random to the key and we need special setup
- in case of error, all the install process need to be restarted
In the end, I created a /, /home and swap partition without encryption. Encryption will be done later.
bootloader
I don't want that grub mess with mbr.
I told grub to install in the / partition (the install help is wrong : it tells that sda2 is (hd0,2) but it is (hd0,1)
As recommended I used easyBCD : http://doc.ubuntu-fr.org/tutoriel/comment_amorcer_ubuntu_avec_bootmgr
But it is buggy : it removed the Win7 entry (after some manipulation, I had to restore it by hand).
NOTE : this tools gpl software (grub4dos), without giving credit/source.
NOTE2 : does grub update work ? Does easyBCD copy some part of the bootloader on windows partition ?
Alternative
- make the grub partition a bootable one, and hope mbr will use it
- the syslinux mbr can be used if the default one doesn't work
- this will make all the chain mbr + grub opensource and don't depend of windows partition
But does windows can be on a partition that it is not bootable ?
- use a mbr with hardcoded partition
touchpad
After booting on Linux the touchpad doesn't work with xorg (work with gpm). There is detection failure.
encryption
Finally I used the hp "DriveLock" feature.
What is it ?
It use the ATA-3 password feature (in high mode) : ftp://ftp.compaq.com/pub/supportinformation/papers/na118a0598.pdf
With that specification, a password need to be entered to unlock the drive when it is power-on.
So without the password, the disk should be not usable. But some of them got backdoors or you can dismantle the disk and read the magnetic cylinder.
But this should be enough to prevent robber to access my data.
Note : HP cache the password to not enter it again on suspend to ram.
futher optimisation
We saw than if B = 1 << b, then
- A * B == A << b
- A / B == A >> b
- A % B == A & (B - 1) == A & ((1U << b) - 1)
But there are interesting property if we have 2 power of 2.
- B1 * B2 == (1 << b1) * (1 << b2) == 1 << (b1 + b2)
- B1 / B2
- if B1 >= B2, (1 << b1) / (1 << b2) == 1 << (b1 - b2)
- if B1 < B2, 0
- A / (B1 / B2) == A / (1 << (b1 - b2)) == A >> (b1 - b2) because (B1 / B2) can't be null in C
- A * (B1 / B2)
- if b1 - b2 >= 0, A * (1 << (b1 - b2)) == A << (b1 - b2)
- if b1 - b2 < 0, 0
This means macro is not enough, but compiler isn't often clever to detect this. To have efficient code, better feed compiler with precomputed stuff.
int divu3(uint a, uint b) { return a / ((1U<<b) / 4); }int divu300(uint a, uint b) { return a / (1<<(b-2)); }
divu3: stmfd sp!, {r3, lr} mov r3, #1 mov r1, r3, asl r1 mov r1, r1, lsr #2 bl {aeabi_uidiv ldmfd sp!, {r3, pc} divu300: sub r1, r1, #2 mov r0, r0, lsr r1 mov pc, lr
PS : arm compiler is not able to optimize A / B and A * B …
FATFS
A tiny fatfs implementation can be found at http://elm-chan.org/fsw/ff/00index_e.html and petit implementation.
After looking at the code, it seems there is a problem in the code : it assumes fat sector size should match hardware sector size.
This is not true, fat sector size only need to be >= hardware sector size
Also the power of 2 trick can be used.
Note the guy got interresting SPI mmc stuff http://elm-chan.org/docs/mmc/mmc_e.html
Theory
Suppose we have 2 unsigned integer A and B. B is a power of 2 ie there is a number b such as B = 1U << b
It can be checked with the following code
static inline ispowerof2(int b) { return b & !(b & (b - 1)); }
b can be computed by ilog2 function
int ilog2(unsigned int v) { int i = -1; while(v) { i++; v >>= 1; } return v; }
or from http://en.wikipedia.org/wiki/Binary logarithm#Integer
/ * Returns the floor form of binary logarithm for a 32 bit integer. * −1 is returned if ''n'' is 0. */ int floorLog2(unsigned int n) { if (n == 0) return -1;int pos = 0; if (n >= 1<<16) { n >>= 16; pos += 16; } if (n >= 1<< 8) { n >>= 8; pos += 8; } if (n >= 1<< 4) { n >>= 4; pos += 4; } if (n >= 1<< 2) { n >>= 2; pos += 2; } if (n >= 1<< 1) { pos += 1; } return pos; }
In this case we have :
- A * B == A << b
- A / B == A >> b
- A % B == A & (B - 1) == A & ((1U << b) - 1)
Usage in C
Either we can define your own function for * / % or we can use compiler optimisation. If we replace B by 1 << b (with a macro), then
int divu(uint a, uint b) { return a / (1U << b); }int modu(uint a, uint b) { return a % (1U << b); }
int mulu(uint a, uint b) { return a * (1U << b); }
will give you in arm assembly (arm-linux-gnueabi-gcc -Os -S )
divu: mov r0, r0, lsr r1 mov pc, lr modu: mvn r3, #0 bic r0, r0, r3, asl r1 mov pc, lr mulu: mov r0, r0, asl r1 mov pc, lr
That's pretty fast without ugly operator redefinition !!!
But add and sub are slower [1], because we need to compute the shift.
Note a gcc missoptimisation (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48812)
int divu3(uint a, uint b) { return a / ((1U<<b) / 4); }
divu3: stmfd sp!, {r3, lr} mov r3, #1 mov r1, r3, asl r1 mov r1, r1, lsr #2 bl {aeabi_uidiv ldmfd sp!, {r3, pc}
This is unfortunate, the compiler could have generated :
divu3: mov r3, #1 mov r1, r3, asl r1 mov r1, r1, lsr #2 mov r0, r0, lsr r1
[1]
int addu(uint a, uint b) { return a + (1U << b); }
addu: mov r3, #1 add r0, r0, r3, asl r1 mov pc, lr
This blog is powered by ikiwiki.