Matthieu Castet garbage collection

link garbage collection

link gc can be activated with :

CFLAGS+=-fdata-sections -fno-common
CFLAGS+=-ffunction-sections
LDOPTS+=-Wl,--gc-sections -Wl,--print-gc-sections -Wl,--entry=entry

The compiler (CFLAGS) options will make each function it is own section. The linker (LDOPTS) option will make the linker, include all code/date used by the entry function, and garbage collect all other code.

These option can be a win on large project, but this imply overhead in code.

In the normal mode, gcc put all code/data of a file in one section. And in this section stuff can't be moved by the linker.

But now gcc don't know how the linker will organise section and can cause overhead.

fdata-sections overhead

For example it will incread code size when accessing global data :

int bar;
int titi;
int tata=1;
int foo=2;

int toto(void) { return foo+tata+titi+bar; }

bar and titi are in bss tata and foo in data

arm-none-eabi-gcc -Os -c

00000000 <toto>:
   0:	e59f3020 	ldr	r3, [pc, #32]	; 28 <toto+0x28>
   4:	e8930005 	ldm	r3, {r0, r2}
   8:	e0800002 	add	r0, r0, r2
   c:	e59f3018 	ldr	r3, [pc, #24]	; 2c <toto+0x2c>
  10:	e5933000 	ldr	r3, [r3]
  14:	e0800003 	add	r0, r0, r3
  18:	e59f3010 	ldr	r3, [pc, #16]	; 30 <toto+0x30>
  1c:	e5933000 	ldr	r3, [r3]
  20:	e0800003 	add	r0, r0, r3
  24:	e12fff1e 	bx	lr
  28:
  2c:
  30:

arm-none-eabi-gcc -Os -fno-common -c

00000000 <toto>:
   0:	e59f3018 	ldr	r3, [pc, #24]	; 20 <toto+0x20>
   4:	e8930005 	ldm	r3, {r0, r2}
   8:	e0800002 	add	r0, r0, r2
   c:	e59f3010 	ldr	r3, [pc, #16]	; 24 <toto+0x24>
  10:	e893000c 	ldm	r3, {r2, r3}
  14:	e0800002 	add	r0, r0, r2
  18:	e0800003 	add	r0, r0, r3
  1c:	e12fff1e 	bx	lr
  20:
  24:

arm-none-eabi-gcc -Os -fno-common -fdata-sections -c

00000000 <toto>:
   0:	e59f3028 	ldr	r3, [pc, #40]	; 30 <toto+0x30>
   4:	e5930000 	ldr	r0, [r3]
   8:	e59f3024 	ldr	r3, [pc, #36]	; 34 <toto+0x34>
   c:	e5933000 	ldr	r3, [r3]
  10:	e0800003 	add	r0, r0, r3
  14:	e59f301c 	ldr	r3, [pc, #28]	; 38 <toto+0x38>
  18:	e5933000 	ldr	r3, [r3]
  1c:	e0800003 	add	r0, r0, r3
  20:	e59f3014 	ldr	r3, [pc, #20]	; 3c <toto+0x3c>
  24:	e5933000 	ldr	r3, [r3]
  28:	e0800003 	add	r0, r0, r3
  2c:	e12fff1e 	bx	lr
  30:
  34:
  38:
  3c:

Note that -fno-common can help to generate better code with bss data.

optimisation

  • 2 pass build : detect unused stuff and build and optimised version.
  • linker to patch the generated code ?

ffunction-sections overhead

Gcc sometimes need to use trampoline.

For example on armv4t, there is not blx instruction. codesourcery arm-2011.03 (elf target) generate code like :

000c7848 <conf_load_defaults>:
   c7848:       b538            push    {r3, r4, r5, lr}
[]
   c7870:       f000 f812       bl      c7898 <memcpy_from_thumb>
[]
   c7888:       bc01            pop     {r0}
   c788a:       4700            bx      r0

000c7898 <memcpy_from_thumb>: c7898: 4778 bx pc c789a: 46c0 nop ; (mov r8, r8) c789c: eaff630f b a04e0 <memcpy>

and with ffunction-sections, there is lot's of memcpy_from_thumb in different section and the linker doesn't merge them.

In fact gcc generate

[]
   6:   f7ff fffe       bl      0 <memcpy>
[]
and the linker patch the code !!!

Note : there was lot's of memcpy_from_thumb if we din't merge .text* in the linker script.

armv5t

using armv5t, we got

000c538c <conf_load_defaults>:
   c538c:       b538            push    {r3, r4, r5, lr}
[]
   c53b4:       f7da eea4       blx     a0100 <memcpy>
[]
   c53c8:       bd38            pop     {r3, r4, r5, pc}

other optimisation

build one big source file

make static the default stuff :

  • -fwhole-program

agregate all source file in one :

  • -combine

Eat lot's of memory

LTO

Extra notes

script to compare code

For comparing function size of 2 binaries, we can use

readelf -W -s prog1.elf | grep FUNC | sort -k8 | sort -n -s -k 3,3 | awk '{ print $3" "$8 }' > dump1
readelf -W -s prog2.elf | grep FUNC | sort -k8 | sort -n -s -k 3,3 | awk '{ print $3" "$8 }' > dump2
diff -u dump1 dump2

Thumb interworking

http://wiki.debian.org/ArmEabiPort#Choice_of_minimum_CPU

Instruction safe for interworking :

  1. mov pc,lr : starting armv7
  2. bx lr : starting armv4t
  3. ldm/ldr : starting armv5t
  4. blx : starting armv5t

This is a shame that arm did add thumb support from the start for normal branch operation

Posted Thu Jun 16 22:51:18 2011 Tags:

Installation of debian on hp elitebook

Partition

The pc got 4 primary partition : 300M Win7 boot 220G Win7 C: 40G HP_RESCUE 2G HP_TOOLS for efi stuff : http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01564727/c01564727.pdf?jumpid=reg_R1002_USEN

I want to keep all the partition and install linux, for that :

  • we need to destroy one primary partition to create an extended one
  • we choose HP_TOOLS, that could be in extended one
  • we skrink Win7 C: (using windows tools), but because of an unmovable file we can get only 110G free
  • we don't care getting 2G of delete HP_TOOLS

So we got :

  • 300M Win7 boot
  • 110G Win7 C:
  • extended
    • 106G linux
    • 2G swap
    • 2G HP_TOOLS
  • 40G HP_RESCUE
  • 2G free

Debian installer (testing) : encryption

I tried to encrypt my system, but this is a mess :

  • the installer want the swap encrypted
  • in order to support suspend to disk, we can't set random to the key and we need special setup
  • in case of error, all the install process need to be restarted

In the end, I created a /, /home and swap partition without encryption. Encryption will be done later.

bootloader

I don't want that grub mess with mbr.

I told grub to install in the / partition (the install help is wrong : it tells that sda2 is (hd0,2) but it is (hd0,1)

As recommended I used easyBCD : http://doc.ubuntu-fr.org/tutoriel/comment_amorcer_ubuntu_avec_bootmgr

But it is buggy : it removed the Win7 entry (after some manipulation, I had to restore it by hand).

NOTE : this tools gpl software (grub4dos), without giving credit/source.

NOTE2 : does grub update work ? Does easyBCD copy some part of the bootloader on windows partition ?

Alternative

  • make the grub partition a bootable one, and hope mbr will use it
    • the syslinux mbr can be used if the default one doesn't work
    • this will make all the chain mbr + grub opensource and don't depend of windows partition

But does windows can be on a partition that it is not bootable ?

  • use a mbr with hardcoded partition

touchpad

After booting on Linux the touchpad doesn't work with xorg (work with gpm). There is detection failure.

encryption

Finally I used the hp "DriveLock" feature.

What is it ?

It use the ATA-3 password feature (in high mode) : ftp://ftp.compaq.com/pub/supportinformation/papers/na118a0598.pdf

With that specification, a password need to be entered to unlock the drive when it is power-on.

So without the password, the disk should be not usable. But some of them got backdoors or you can dismantle the disk and read the magnetic cylinder.

But this should be enough to prevent robber to access my data.

Note : HP cache the password to not enter it again on suspend to ram.

Posted Sat Jun 11 00:00:00 2011
Matthieu Castet integer power of 2

futher optimisation

We saw than if B = 1 << b, then

  1. A * B == A << b
  2. A / B == A >> b
  3. A % B == A & (B - 1) == A & ((1U << b) - 1)

But there are interesting property if we have 2 power of 2.

  1. B1 * B2 == (1 << b1) * (1 << b2) == 1 << (b1 + b2)
  2. B1 / B2
    1. if B1 >= B2, (1 << b1) / (1 << b2) == 1 << (b1 - b2)
    2. if B1 < B2, 0
  3. A / (B1 / B2) == A / (1 << (b1 - b2)) == A >> (b1 - b2) because (B1 / B2) can't be null in C
  4. A * (B1 / B2)
    1. if b1 - b2 >= 0, A * (1 << (b1 - b2)) == A << (b1 - b2)
    2. if b1 - b2 < 0, 0

This means macro is not enough, but compiler isn't often clever to detect this. To have efficient code, better feed compiler with precomputed stuff.

int divu3(uint a, uint b)
{
        return a / ((1U<<b) / 4);
}

int divu300(uint a, uint b) { return a / (1<<(b-2)); }

divu3:
        stmfd   sp!, {r3, lr}
        mov     r3, #1
        mov     r1, r3, asl r1
        mov     r1, r1, lsr #2
        bl      {aeabi_uidiv
        ldmfd   sp!, {r3, pc}
divu300:
        sub     r1, r1, #2
        mov     r0, r0, lsr r1
        mov     pc, lr

PS : arm compiler is not able to optimize A / B and A * B …

Posted Fri Apr 29 23:05:24 2011 Tags:

FATFS

A tiny fatfs implementation can be found at http://elm-chan.org/fsw/ff/00index_e.html and petit implementation.

After looking at the code, it seems there is a problem in the code : it assumes fat sector size should match hardware sector size.

This is not true, fat sector size only need to be >= hardware sector size

Also the power of 2 trick can be used.

Note the guy got interresting SPI mmc stuff http://elm-chan.org/docs/mmc/mmc_e.html

Posted Thu Apr 28 23:19:15 2011 Tags:
Matthieu Castet power of 2 arithmetic

Theory

Suppose we have 2 unsigned integer A and B. B is a power of 2 ie there is a number b such as B = 1U << b

It can be checked with the following code

static inline ispowerof2(int b) {
	return b & !(b & (b - 1));
}

b can be computed by ilog2 function

int ilog2(unsigned int v) {
	int i = -1;
	while(v) {
		i++;
		v >>= 1;
	}
	return v;
}

or from http://en.wikipedia.org/wiki/Binary logarithm#Integer

/
 * Returns the floor form of binary logarithm for a 32 bit integer.
 * −1 is returned if ''n'' is 0.
 */
int floorLog2(unsigned int n) {
  if (n == 0)
    return -1;

int pos = 0; if (n >= 1<<16) { n >>= 16; pos += 16; } if (n >= 1<< 8) { n >>= 8; pos += 8; } if (n >= 1<< 4) { n >>= 4; pos += 4; } if (n >= 1<< 2) { n >>= 2; pos += 2; } if (n >= 1<< 1) { pos += 1; } return pos; }

In this case we have :

  1. A * B == A << b
  2. A / B == A >> b
  3. A % B == A & (B - 1) == A & ((1U << b) - 1)

Usage in C

Either we can define your own function for * / % or we can use compiler optimisation. If we replace B by 1 << b (with a macro), then

int divu(uint a, uint b)
{
        return a / (1U << b);
}

int modu(uint a, uint b) { return a % (1U << b); }

int mulu(uint a, uint b) { return a * (1U << b); }

will give you in arm assembly (arm-linux-gnueabi-gcc -Os -S )

divu:
        mov     r0, r0, lsr r1
        mov     pc, lr
modu:
        mvn     r3, #0
        bic     r0, r0, r3, asl r1
        mov     pc, lr
mulu:
        mov     r0, r0, asl r1
        mov     pc, lr

That's pretty fast without ugly operator redefinition !!!

But add and sub are slower [1], because we need to compute the shift.

Note a gcc missoptimisation (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48812)

int divu3(uint a, uint b)
{
        return a / ((1U<<b) / 4);
}

divu3:
        stmfd   sp!, {r3, lr}
        mov     r3, #1
        mov     r1, r3, asl r1
        mov     r1, r1, lsr #2
        bl      {aeabi_uidiv
        ldmfd   sp!, {r3, pc}

This is unfortunate, the compiler could have generated :

divu3:
        mov     r3, #1
        mov     r1, r3, asl r1
        mov     r1, r1, lsr #2
        mov     r0, r0, lsr r1

[1]

int addu(uint a, uint b)
{
        return a + (1U << b);
}
addu:
        mov     r3, #1
        add     r0, r0, r3, asl r1
        mov     pc, lr

Posted Tue Apr 26 22:38:20 2011 Tags:

This blog is powered by ikiwiki.