The communication bottleneck is the worst thing in a multi-threaded chess program on CUDA (or OpenCL), as there’s few paper on that, especially considering the number of thread involved, compared to multi-CPU multi-core computers.
While Robert Hyatt and others try to have the best efficiency with up to 16 or 32 threads, maybe 64, the number of threads you may deal with on CUDA is one or two order of magnitude larger. And nodes/cpu are decreasing fast on more than 4 threads (using DTS high-performance parallel search), or 2 threads (with PVS algorithm).
A simple integrated graphic chip, like the GeFroce 9400M on laptops, is designed to run 144 threads at-once and might run thousands of threads, to hide memory latency. My desktop development system with GeForce GTX 260 needs at least 5 000 threads to be used at 100%.
The problem is that CUDA GPU needs a maximum number of threads to hide some instruction latencies (such as MULtiplications) or memory access latency (or worse ATOMIC m,emory access latency); and on the other side, actual parallel algorithm are designed for a few number of threads and unable to handle more than few dozen threads where we need to be able to handle thousands!
The problems:
- Feed thousands of running threads with subtrees
- Communicate efficiently between these threads (ATOMIC is a no-go)
- Balance load between threads, back and forth
Robert Hyatt DTS algorithm is an interesting point to start from, because it’s well documented, running on the best chess programs, with evident benefits over previous PVS algorithm.
My key idea wasn’t to make DTS works for 5 000 threads, but instead, to reduce the number of tasks that DTS should balance to the number of SM (each SM as 8 SP and handle from 32 to 768 threads basically), so there will be 2 tasks on an IGP GeForce 9400M and 27 tasks on GeForce GTX 260. Back on track with Robert Hyatt results
On each SM, I used only 32 threads in parallel, from the 192 threads expected to hide MUL latencies, because I don’t care about 1 MULtiplication, but care about register usage and Shared Memory limitations, respectively 8192 registers and 16KB.
Then I applied another reduction, grouping 8-threads into a group (micro-threading), having just 4 group by SM, each one processing one position, to -again- reduce communication needs between threads, using CUDA natural communication and synchronization technology: they share the same Instruction Pointer
So each SM, and it’s threads will be working on a subtree, doing a depth-first search, and splitting internally the work between the different macro-threads, with only 4 related position to examine in parallel. Reducing divergence between threads on the same warp because they work on positions isssued from the same split point, so they usually share a large amount of common representation and thus non-divergent processing.
And how to communicate between these 4 groups into the SM? Naturally using the DTS algorithm, with a simple implementation, as these threads are naturally synchronized by their Instruction Pointer, they will simply store results after each move or leaf and first one will handle split point to allocate work for the 4 into a group buffer in Shared memory.
At the SM level, there’s the advantages:
- More registers available for execution (usually 64)
- More Shared Memory available (approx. 4KB per group)
- Less divergence
- Quick balance of the workload between the 4 groups
- No latency when a branch is cut (done after each move!)
- Efficiency is approximately 75%
At the GPU-level:
- We use the same DTS algorithm well-known and efficient
- We handle between 2 and 31 SP, staying in the scope of DTS algorithm
- Efficiency is between 98% (2 SP) and 62% (27 SP)
On conclusion…
Using DTS algorithm on a high-level with low-level thread reduction and grouping (macro-threading and micro-threading), you could use a GPU with 64 threads at 74% and a GPU with 864 threads (GTX 260 216-core) with 47% efficiency, that is far more than expected using the best algorithms alone, with nVidia recommended number of threads.
PS: notice that the 47% efficiency correspond to only 10 threads using basic PVS algorithm, while we are actually using 864 threads on the GTX260 GPU!