Difference between revisions of "Different implementations of a simple Collatz iterator and their performance"

Revision as of 04:42, 4 January 2009

The Collatz conjecture was proposed by Lothar Collatz in 1937. The conjecture is also known as the 3n + 1 conjecture.

The procedure is that if n is divisible by two then divide by two, else multiply by 3 and add 1, iterate until n reaches 1. The unproven conjecture is that for all values of n the procedure will always reach 1.

The benchmark times the iteration of the 2²⁶ first values of n.

1 Visual C++
- 1.1 Vista 64, 2.4 GHz core 2 Q6600
- 1.2 Vista 32, 2.66 GHz core 2 E6750
2 Visual Basic 2008
- 2.1 Vista 64, 2.4 GHz core 2 Q6600
- 2.2 Vista 32, 2.66 GHz core 2 E6750
3 x86 assembly language
- 3.1 Vista 32, 2.66 GHz core 2 E6750
4 CUDA
- 4.1 GeForce 8800 GTS 512 MB (G92) 1.625 GHz
5 BSGP
- 5.1 GeForce 8800 GTS 512 MB (G92) 1.625 GHz
6 Conclusions
- 6.1 CPU performance
- 6.2 GPU performance
7 External links

Visual C++

Both Microsoft Visual C++ and Intel C++ were tested and they gave similar results.

Vista 64, 2.4 GHz core 2 Q6600

22.1s - 64 bit exe, one thread, unrolled 1000 times
30.5s - 64 bit exe, one thread, unrolled 10 times
45.9s - 32 bit exe, one thread, unrolled 1000 times
49.2s - 64 bit exe, one thread, not unrolled
53.3s - 32 bit exe, one thread, unrolled 10 times
76.1s - 32 bit exe, one thread, not unrolled

Vista 32, 2.66 GHz core 2 E6750

41.3 - 32 bit exe, one thread, unrolled 1000 times
47.7 - 32 bit exe, one thread, unrolled 10 times
65.6 - 32 bit exe, one thread, not unrolled

Visual Basic 2008

The code is compiled to common intermediate language that is made executable by the .net just in time compiler. The advantage is that the JIT compiler can optimize the code for the specific CPU at runtime. In this case the slower CPU with 64 bit registers is the fastest because of the optimization. The code should also run unmodified on Linux with Mono installed. The multithreaded version uses Microsoft Parallel Extensions to .NET Framework 3.5 and required almost no change to the program.

Vista 64, 2.4 GHz core 2 Q6600

55.4s - .net exe, one thread
11.3s - .net exe, four threads, unrolled 100 times

Vista 32, 2.66 GHz core 2 E6750

66.4s - .net exe, one thread
42.9s - .net exe, one thread, unrolled 100 times
33.6s - .net exe, two threads
22.9s - .net exe, two threads, unrolled 100 times

x86 assembly language

Because of the suspected low quality of 32 bit code generated by modern C++ compilers a hand crafted assembly version was made for comparison. The performance gain was made by keeping the value in a 64 bit MMX register and exploiting some statistical properties of the algorithm that is not obvious to a normal compiler.

Vista 32, 2.66 GHz core 2 E6750

55.2s - optimized, not unrolled
26.4s - optimized, unrolled 10 times

CUDA

To check the quality of the compiler a handcrafted version was made in PTX assembly, it produced similar results.

GeForce 8800 GTS 512 MB (G92) 1.625 GHz

2.07s - 128 blocks x 256 threads

BSGP

The Bulk-Synchronous GPU Programming compiler is highly advanced and is capable of impressive results but the documentation is severely lacking and the future of the compiler seems unclear.

GeForce 8800 GTS 512 MB (G92) 1.625 GHz

2.04s - 32768 threads

Conclusions

CPU performance

GPU performance

All three versions gave almost identical results which suggests that it is easy to achieve consistently good results if the algorithm and the hardware is well understood. The slight disappointment is that the GPU with 112 cores running at half the CPU frequency did not achieve 112 / 2 = 56 times speedup but only 11 times. The same code running at four CPU cores would reduce the advantage to 2.75 times. This does not mean that the GPU is disappointingly slow, it means that it is better at some things than other, just like a CPU.

External links

@@ Line 6: / Line 6: @@
 == Visual C++ ==
+Both Microsoft Visual C++ and Intel C++ were tested and they gave similar results.
 === Vista 64, 2.4 GHz core 2 Q6600 ===
 *22.1s - 64 bit exe, one thread, unrolled 1000 times
@@ Line 37: / Line 38: @@
 == x86 assembly language ==
-Because of the suspected low quality of 32 bit code generated by modern C++ compilers a hand crafted assembly version was made for comparison. The performance gain was made by keeping the value in a 64 bit MMX register and exploiting some statistical properties of thg ealgorithm that is not obvious to a normal compiler.
+Because of the suspected low quality of 32 bit code generated by modern C++ compilers a hand crafted assembly version was made for comparison. The performance gain was made by keeping the value in a 64 bit MMX register and exploiting some statistical properties of the algorithm that is not obvious to a normal compiler.
 === Vista 32, 2.66 GHz core 2 E6750 ===
@@ Line 45: / Line 46: @@
 == CUDA ==
+To check the quality of the compiler a handcrafted version was made in PTX assembly, it produced similar results.
 === GeForce 8800 GTS 512 MB (G92) 1.625 GHz  ===
 * 2.07s - 128 blocks x 256 threads
@@ Line 50: / Line 52: @@
 == BSGP ==
-The [http://www.kunzhou.net/#BSGP Bulk-Synchronous GPU Programming] compiler is advanced and is capable of impressive results but the documentation is severely lacking and the future of the compiler seems unclear.
+The [http://www.kunzhou.net/#BSGP Bulk-Synchronous GPU Programming] compiler is highly advanced and is capable of impressive results but the documentation is severely lacking and the future of the compiler seems unclear.
 === GeForce 8800 GTS 512 MB (G92) 1.625 GHz  ===
 * 2.04s - 32768 threads
-== Conclusion ==
+== Conclusions ==
+=== CPU performance ===
+=== GPU performance ===
+All three versions gave almost identical results which suggests that it is easy to achieve consistently good results if the algorithm and the hardware is well understood. The slight disappointment is that the GPU with 112 cores running at half the CPU frequency did not achieve 112 / 2 = 56 times speedup but only 11 times. The same code running at four CPU cores would reduce the advantage to 2.75 times. This does not mean that the GPU is disappointingly slow, it means that it is better at some things than other, just like a CPU.

Difference between revisions of "Different implementations of a simple Collatz iterator and their performance"

Revision as of 04:42, 4 January 2009

Contents

Visual C++

Vista 64, 2.4 GHz core 2 Q6600

Vista 32, 2.66 GHz core 2 E6750

Visual Basic 2008

Vista 64, 2.4 GHz core 2 Q6600

Vista 32, 2.66 GHz core 2 E6750

x86 assembly language

Vista 32, 2.66 GHz core 2 E6750

CUDA

GeForce 8800 GTS 512 MB (G92) 1.625 GHz

BSGP

GeForce 8800 GTS 512 MB (G92) 1.625 GHz

Conclusions

CPU performance

GPU performance

External links

Navigation menu

Views

Personal tools

Contents

Science Zero

Search

Tools