Here's a simple piece of c-code (try zipped version) for testing how to parallelize code with OpenMP. It compiles with
gcc -fopenmp -lm otest.c
The CPU-load while running looks like this:
Looks like two logical CPUs never get used (two low lines beyond "5" in the chart). It outputs some timing information:
running with 1 threads: runtime = 17.236827 s clock=17.230000
running with 2 threads: runtime = 8.624231 s clock=17.260000
running with 3 threads: runtime = 5.791805 s clock=17.090000
running with 4 threads: runtime = 5.241023 s clock=20.820000
running with 5 threads: runtime = 4.107738 s clock=20.139999
running with 6 threads: runtime = 4.045839 s clock=20.240000
running with 7 threads: runtime = 4.056122 s clock=20.280001
running with 8 threads: runtime = 4.062750 s clock=20.299999
which can be plotted like this:
I'm measuring the clock-cycles spent by the program using clock(), which I hope is some kind of measure of how much work is performed. Note how the amount of work increases due to overheads related to creating threads and communication between them. Another plot shows the speedup:
The i7 uses Hyper Threading to present 8 logical CPUs to the system with only 4 physical cores. Anyone care to run this on a real 8-core machine ? 🙂
Next stop is getting this to work from a Boost Python extension.