libmad on the BeagleBone

(or really any Cortex-A)

[[notes/bonemad]]

I’m mainly interested in performance as power used ~= 1/performance.

Using libmad-0.15.1b-7ubuntu1 from Precise and Linaro GCC 4.6 2012.02 as a cross build setup.

Has its own complicated and out of date way of selecting the optimisations. Ubuntu turns this into a -O2. Scans the CFLAGS to pull out the optimisation.

At 720 MHz over USB.

Default -O2 setup: 23.4 s, 21.4 s, 21.5 s, 21.5 s

Switch to userspace governor and lock at 720 MHz: 21.2 s, 21.2 s, 21.2 s

-O3 setup: 21.2 s, 21.2 s…

…as it’s picking up the system libmad!

-O3 setup: 21.5 s, 21.5 s

-O2 setup: 21.5 s, 21.5 s

There’s very little difference in size - ~30 bytes. As the Ubuntu patch forces it to -O2 and ignores the earlier CFLAGS parsing.

-O3 setup: 21.7 s, 21.7 s - slower!

Disable the assembly routines and see how it changes.

-O3 noasm: 20.0 s, 20.0 s

-O2 noasm: 20.4 s, 20.4 s

-O3 noasm -mfpu=neon (turns on the vectoriser): 19.9 s, 19.9 s.

Not much change which suggests a very hot function or some bad code. perf time!

perf report:

55.17%  minimad  libc-2.13.so       [.] _IO_putc
15.50%  minimad  libmad.so.0.2.1    [.] synth_full
 7.02%  minimad  libmad.so.0.2.1    [.] III_decode
 6.51%  minimad  libmad.so.0.2.1    [.] loop
 5.29%  minimad  libmad.so.0.2.1    [.] dct32
 2.64%  minimad  minimad            [.] output
 1.81%  minimad  libmad.so.0.2.1    [.] III_imdct_l
 1.29%  minimad  libmad.so.0.2.1    [.] mad_bit_read
 0.97%  minimad  libmad.so.0.2.1    [.] III_aliasreduce
 0.96%  minimad  libmad.so.0.2.1    [.] normal_block_x0_to_x17
 0.89%  minimad  libmad.so.0.2.1    [.] normal_block_x18_to_x35
 0.54%  minimad  minimad            [.] mad_stream_errorstr@plt
 0.41%  minimad  minimad            [.] mad_decoder_finish@plt

Or, in other words, dominated by the sample writer in minimad. Probably due to this:

sample = scale(*left_ch++);
putchar((sample >> 0) & 0xff);
putchar((sample >> 8) & 0xff);

if (nchannels == 2) {
  sample = scale(*right_ch++);
  putchar((sample >> 0) & 0xff);
  putchar((sample >> 8) & 0xff);
}

Writes 1152 samples per callback. Turn this into a scale-to-buffer to keep some semblance of an output layer…

8.3 s, 8.3 s. Much better. perf shows:

36.80%  minimad  libmad.so.0.2.1    [.] synth_full
17.44%  minimad  libmad.so.0.2.1    [.] III_decode
15.46%  minimad  libmad.so.0.2.1    [.] loop
13.12%  minimad  libmad.so.0.2.1    [.] dct32
 3.84%  minimad  libmad.so.0.2.1    [.] III_imdct_l
 3.32%  minimad  libmad.so.0.2.1    [.] mad_bit_read
 2.33%  minimad  libmad.so.0.2.1    [.] III_aliasreduce
 2.31%  minimad  libmad.so.0.2.1    [.] normal_block_x18_to_x35
 2.21%  minimad  libmad.so.0.2.1    [.] normal_block_x0_to_x17
 0.59%  minimad  minimad            [.] output

-O3 noasm novect: 8.6 s, 8.7 s

Note the noasm version is lower fidelity. I guess the assembly version keeps things in 64 bits for longer.

-O3 asm novect: 10.3 s, 10.3 s

-O3 noasm vect -marm: 8.2 s, 8.2 s. So Thumb-2 is similar to ARM mode.

-O3 noasm vect -mtune=cortex-a8: 8.8 s, 8.1 s, 8.1 s. Tuned for A8 instead of A9 is slightly better.

Hot functions

Decent so far is -O3 noasm vect -mtune=cortex-a8 in Thumb-2. Hot functions are:

37.33%  minimad  libmad.so.0.2.1    [.] synth_full
17.46%  minimad  libmad.so.0.2.1    [.] loop
15.93%  minimad  libmad.so.0.2.1    [.] III_decode
12.10%  minimad  libmad.so.0.2.1    [.] dct32
 4.23%  minimad  libmad.so.0.2.1    [.] III_imdct_l
 2.75%  minimad  libmad.so.0.2.1    [.] mad_bit_read

synth_full is dominated by ML0 and MLAs. There’s a 1..16 loop in there which the vectoriser could hit. The compiler is spotting the mlas.