Introduction to Program Optimization [2002-11-10]

0) Foreword
--------------------------------------------

Any decent programmer equipped with a HLL can develop just about anything you want; where good and bad differ is program execution speed, size, portability, the time it takes to write, and maintainability / readability. This article will attempt to explore the field of optimization for speed & size. Unfortunately, this typically comes at the expense of the other factors involved. Unless you're doing it for fun or education, better make sure what you're tweaking is a bottleneck (profile [1] the program in normal operation).
Why should we optimize at all? Grossly suboptimal code wastes the user's time and electricity; consider the net economic loss of all your users having to twiddle their thumbs. Also, with improvements in program efficiency, your application can handle larger amounts of data with the same cost.
Sound good? Let's jump right in!


1) High level
--------------------------------------------

The first thing to do is make sure your design, data structures, and algorithms are sound. No sense in writing an extremely tight bogo sort [2], when it will likely be beaten by a simple implementation of heapsort.
Can you prove worst case computational complexity (Big-O notation)?

Also important is a basic understanding of what's fast, and what kills performance. Simple arithmetic / logic instructions such as +, -, ^, & [...] are very fast; multiplying a bit slower, dividing more so; complex floating point operations (e.g. exp, sin) much slower.

In the spirit of removing unnecessary code, hoisting (pulling invariant calculations out of a loop) can bring big gains.
Example:

int array[10][10];
for(int i = 0; i < 10; i++)
	for(int j = 0; j < 10; j++)
		array[i*10 + j] = i*10 + j;

=>

int* p = array;
for(int i = 0; i < 100; i++)
	*p++ = i;


2) Mid-level
--------------------------------------------

The biggest issues here are:
 
a) Memory Accesses

In the last few years, gains in CPU speed have far outstripped progress in reducing DRAM access latency. Memory is clocked much slower than the processor, and there's a long way from the CPU die to the DRAM slots.
One way of closing this performance gap is the use of a cache: a hierarchy of increasingly faster but smaller capacity SRAM, which can quickly service common memory accesses (cache hit), while deferring to the next slower layer / main memory for data not in the cache (cache miss).
Given the x86's serious lack of registers, computations can quickly become memory-bound; optimizing memory hits to take advantage of the cache can bring huge performance increases.

Now properly motivated, it may be good to return to old Computer Engineering textbooks and examine the operation of a cache. For reference, Athlon XP-class CPUs have 64 KiB of Level1 data cache, with 2x set associativity.

As a takeaway, consider the following simple guidelines:

- try to access memory sequentially (due to virtual memory fragmentation, pages may not actually be contiguous, but 4 KiB of sequential accesses is already a good start).

- do not access more than 2 separate buffers that may be mapped to the same cache lines. Example: when flipping an image vertically, do not use a row buffer; instead, copy the entire image and then overwrite the rows individually.

- use an optimized memcpy (see separate article) for performance gains of 300% and more.

- do not use large lookup tables as an "optimization": tables that do not fit in L1 can be considered to be a cache miss on every access, in which time much computation could have been done.


b) Conditional Jumps

A little bit of background:
Since the 486, execution is pipelined, i.e. instructions go through an assembly line; typical steps are 'decode', 'fetch operands from memory', etc.
Jumps force a stall in the pipeline: the next instruction has to be read from the new location in memory and decoded; also, all work on instructions following the jump must be discarded. This problem is exacerbated by longer pipelines on recent CPUs (especially P4s with their 20 stage pipes).

One means of reducing this performance impact is branch prediction: conditional jumps are predicted taken / not taken. There are 2 types of prediction: static (usually BTFN, i.e. backward => taken, forward => not taken) and dynamic (taking branch history into account, e.g. gshare and combined algorithms).
Correctly predicted jumps are almost free; getting it wrong is painful.

The moral of this story is to keep mispredicted jumps to a minimum. This means that conditional jumps on random data, which will be predicted incorrectly 50% of the time, should be eliminated if at all possible.
Examples:

Replacing a jump altogether with setcc / cmovcc:

if(a <= b)
	c = 2
else
	c = 1

=>

c = (a <= b)? 2 : 1

=

cmp	a, b
setle	c
inc	c

or

mov	c, 1
mov	d, 2
cmp	a, b
cmovle	c, d

Reducing # of jumps if the jump can't be eliminated:

if(cond)		; true more often than false
	foo()
else
	bar()

=

; cond
jcc	else
	...
jmp	end
else:
	...
end:

=>

bar()
if(cond)
{
	undo_bar()
	foo()
}

Before: 1 jump   After: P(jump) = 1-P(cond) (<< 1)

In general, try to organize code so that the expected case runs linearly. Note, however, that well-predicted branches are basically free!


3) Low level
--------------------------------------------

At this level, there are many factors to consider: selecting instructions to maximize decode throughput; reducing data dependencies; scheduling to increase parallelism. This is all quite CPU-dependent and cannot be imparted with a few mere guidelines.
However, there is a definite incentive to dig in and understand what makes CPUs tick. Even today, compilers produce quite distressingly suboptimal code!

As an example, the critical part of a CLOD triangle stripper was re-implemented in asm and optimized to a very high degree over several days. This yielded an improvement of 2.5x in triangle throughput. Such an increase over the best attempt in the C implementation speaks for itself and belies trite generalizations along the lines of: "humans cannot compete with modern compilers".

To conclude this section, there are a few low-level tricks that can be applied without too much trouble. They are often out of the compiler's reach due to assumptions made about the input domains.

1. Consider isBetween(x, 0, a) := (0 < x && x < a). This can be implemented much more efficiently via ((unsigned)x < a), since 2's complement negative numbers are above any positive numbers.
2. The trick can be extended to 2 such range checks: isBetween(x, 0, a) && isBetween(y, 0, a) === ((unsigned)(x|y) < a).
3. When writing conditional expressions like index = (condition)? 1 : 0, have an eye to implementing it via the carry flag. If transformed to -1 : 0, this could be taken care of very efficiently with the sbb instruction.


4) SIMD
--------------------------------------------

Having borne in mind all the above suggestions, your code will likely not be shabby. How can it be made to really shine?
One answer is SIMD (Single Instruction, Multiple Data). Basically what that means is that one operation is carried out on several pieces of data simultaneously (e.g. (a,b) (+) (x,y) = (a+x, b+y)). The 2 SIMD instruction sets available on x86 CPUs are SSE (Streaming SIMD Extensions, requires Pentium III+) and 3DNow! (supported by all AMD chips since the K6-2). Athlon XP processors also implement SSE, although it's emulated and therefore slower than equivalent 3DNow! code.

I will focus on 3DNow!, because a) I think it's cleaner, better thought out, and easier to use b) I'd like to spread the word - regrettably, not many people seem to use it.
Most things I'll talk about also apply to SSE, though.

Why use SIMD? 2 answers - potentially significantly faster code, and a flat register file instead of the standard x86 floating point stack (all operations involve the top of stack, which means optimized code has to constantly exchange items on the stack. yuck.)

Example:
The vector dot product is screaming for SIMD.
Before: c = a1*b1 + a2*b2 + a3*b3 + a4*b4
(completely unoptimized)
fld	[a1]
fmul	[b1]
fld	[a2]
fmul	[b2]
fld	[a3]
fmul	[b3]
fld	[a4]
fmul	[b4]
faddp
faddp
faddp

After:
movq	mm0, [a2_a1]
movq	mm1, [a4_a3]
pfmul	mm0, [b2_b1]
pfmul	mm1, [b4_b3]
pfadd	mm0, mm1
pfacc	mm0, mm0

Looks promising, no?

NB: this operation is a bit clumsier with SSE, as it lacks an "accumulate" instruction (the SSE way is to 'shuffle' the items around in the register and add).


Tips for integrating into an existing program:

- compiler support: VC++ 6 with the processor pack supports 3DNow! and SSE by means of macros. 
    Example: #include <amd3d.h>
    pfadd(mm0, mm1)
    Note that only very basic memory addressing (base register) is implemented.

- NASM 0.98.34 supports the full instruction sets.

- detecting 3DNow!: use the cpuid instruction to retrieve feature flags and test for bit 30 in edx.

- Porting tips:
In my experience, it's best to convert tiny parts of the algorithm to assembly at a time - testing several hundred lines (actually searching for the inevitable bug) is not fun. Make sure you surround your 3DNow! code with emms instructions - MMX/3DNow! ops mark the FPU tag word (which keeps track of the top of stack) as invalid, which will lead to a stack overflow if floating point ops come before an emms.


5) Optimizing for Size
--------------------------------------------

Why would anyone want to do this in the days of huge hard drives? Granted, disk space is cheap; however, it gives a good feeling to know that there is no unnecessary bloat in the executable. Smaller size usually also means faster startup and certainly pays in form of lessened bandwidth bills / download time.
Finally, tight programs are more impressive and professional; this is very goal in the realm of 'demos'.

A variety of articles already explain how to tweak compiler settings to achieve slightly smaller executables. We instead focus on asm techniques and tricks, especially as used in demos:

a) Whole program techniques

Here we can expect large gains of several hundred bytes for a 4 KiB demo.
- Manually import DLL functions by ordinal. Further, specify the desired functions via 8-bit offset to the next ordinal. 
- Build the PE header manually - this allows interleaving the DOS and PE headers and embedding useful data inside them.
- Use a dropper + 16-bit COM packer (e.g. APACK). Try to make the resulting machine code bytes repetitive to improve compression ratio.

b) Micro techniques

Here is a list to get things started:
- Use 1 byte opcodes whenever possible:
	1. push/pop
	2. pushad/popad
	3. xchg
	4. string
	5. inc/dec
	6. cdq
- Load imm8 into register via push byte; pop reg
- Use e.g. lea edx, [eax+edx*4+123] to perform moderately complex calculations
- jecxz is slow, but saves one byte for dec ecx
- When initializing data, access it via base pointer instead of 32-bit addresses.
- With the above DLL import mechanism, indirect calls need only take 3 bytes (call [ebp+4])
- For each section of code, write register start/end values and clobber lists. Reuse as many values as possible!


6) A few good links
--------------------------------------------

www.agner.org
www.azillionmonkeys.com
www.sandpile.org
hugi
256b
+fravia


That's all, folks! Hopefully this makes a small contribution to motivating and aiding in the reduction of wasted CPU cycles.
Thanks to my friend Sebastian Höfer for getting me to write on the subject.
If this helped and/or you'd like to see more info on a particular topic, drop me a line.
Questions / comments / corrections / additions / challenges welcome!
mailto:Jan.Wassenberg@stud.uni-karlsruhe.de


[1]: use an external program, i.e. AMD CodeAnalyst or Intel's Vtune, to check which parts of the program account for CPU time
[2]: sorting by generating random permutations of the data, and checking if it's sorted