Archive for July, 2009

Macro-threading

Friday, July 17th, 2009

One technique that I plan to use is macro-threading: giving different tasks to group of threads.
On CUDA actual implementations, you have 32 logical threads that shares the same 8 Scalar Processor and the same exact execution path. You could launch 32, 64, 96 … threads per MP, so you could follow multiple execution path on a group of 1 MP + 8 SP.

Macro-threading is the idea of launching n>1 groups of 32 threads on a MP, each group having it’s own execution path, and so each one having it’s tasks that may be totally different from the other.

Takes the hash-table lookup on the move generator: a single load from global memory, but this one will block the generator for too long and lost cpu-cycle. Usually 200-400 GPU cycles.
Launching a second group with the same task will help, either, but will be competing for registers and shared memory space (very limited).

So imagine having a second group of 32 threads that perform global memory-related tasks for the first one:
- pre-fetching positions to evaluate ASAP
- doing hash-table look-up to avoid evaluating positions more than once
- write-back new positions to evaluate in global memory

With this kinda macro-threading, exchanging position to evaluate using shared memory between the 2 groups of threads, you may hide totally global memory latency.

In fact, if move generator is too fast, we could balance it dynamically fom the memory-task thread (that is more a programmed cache system!), feeding the move generator with the new position to evaluate that wasn’t written back to memory, ensuring it will consume all available gpu-cycles.

It’s algorithm may be similar to that, finally:
- For each new position to be write-back, do a hash lookup (and if value is known, handle it)
Push into move generator slot, the first new position not in hash-table, while slots are not full
(gives a new positions to evaluate ASAP to evaluator)
- While there’s available move generator slots, pre-fetch positions to evaluate from global memory, and feed them
(please group these reads :-)
- Write-back in queue new positions generated by the move generator that didn’t fit in the slots
(please group these writes!)

It should limit Global Memory IO, while ensuring maximum usage of the Move generator group of 32 threads.

One caveat: you have to be sure that there will be enough registers to allow the MP to launch these threads at once, elsewhere one of these group (or more) may not be launched simultaneously with the other.

CPU vs CUDA GPU memory bandwidth

Wednesday, July 15th, 2009

What is the memory bandwidth of modern CPU versus that of CUDA-enabled GPU?

As far as I figured it out, I thought GPU memory bandwidth was huge, but I thought that memory bandwidth of CPU L1-cache could be effectively better than actual CUDA architecture.

With all the horsepower delivered by CUDA GPU, up to 10X Gigaflops on GTX than current Core i7/Nehalem processors, we all need to be able to feed them with data and unload results as fast as possible in memory (global videocard memory or computer’s main memory).

I found an interesting article that benchmarked overclocked Core i7 cache and memory bandwidth, in triple-channel with fast DDR3: L1 cache peaks around 50GB/s reading or writing but could do both at once, peaking at 100GB/s, while main computer memory (triple-channel DDR3) was limited to 16 GB/s. That’s actually astonishing anyway, a 3 years old Athlon X2 3800+ (2×2Hz) L1-cache doesn’t deliver more than actual main memory of today!!!

To compare the L1 cache of a CPU (32KB), we should use CUDA Shared Memory (16KB/8 Scalar Processors), and it delivers around 50GB/s too, a value that is strangely similar.

To compare the main memory of the computer we have the Global Memory and it delivers between 100GB/s and 150GB/s, nearly 8X the computer’s main memory bandwidth, due to multiple 64-bits interface (8 instead 3) and higher clock values.

But when you test a shared memory access or a L1-cache access speed, you have to think there’s 4 core on a core i7, each one with it’s dedicated L1-cache, peaking at 200GB-400GB/s depending on the tasks.

On the other side, with 30 groups of 8 Scalar Processors, the Shared Memory of a CUDA GTX 285 may deliver 1500 GB/s, around 4X the aggregated L1-cache of an overclocked Core i7!

To resume, CUDA-enabled GPU offers up to 8X the speed of main memory and 4X the speed of L1-cache compared to a moderne CPU, and it shows!

The problems I am facing

Sunday, July 12th, 2009

There’s many problems to implement gpu-efficient chess engine are these ones

  • short of Scalar Processor register
  • Global Memory latency (and lack of register if interrelated indirectly)
  • Shared memory coalesced accesses
  • Atomic operations latencies (due to Global Memory latency)

These increase the multi-threaded chess problem (or clustered chess engine), as latencies and moreover atomic operations latencies are blocking threads and loosing gpu-cycles.

More than that, with few Scalar Processor registers available, latencies could not be hidden using 64 threads per MP instead 32.

Anyway I am working on a proof-of-concept using Turochamp algorithm…