One technique that I plan to use is macro-threading: giving different tasks to group of threads.
On CUDA actual implementations, you have 32 logical threads that shares the same 8 Scalar Processor and the same exact execution path. You could launch 32, 64, 96 … threads per MP, so you could follow multiple execution path on a group of 1 MP + 8 SP.
Macro-threading is the idea of launching n>1 groups of 32 threads on a MP, each group having it’s own execution path, and so each one having it’s tasks that may be totally different from the other.
Takes the hash-table lookup on the move generator: a single load from global memory, but this one will block the generator for too long and lost cpu-cycle. Usually 200-400 GPU cycles.
Launching a second group with the same task will help, either, but will be competing for registers and shared memory space (very limited).
So imagine having a second group of 32 threads that perform global memory-related tasks for the first one:
- pre-fetching positions to evaluate ASAP
- doing hash-table look-up to avoid evaluating positions more than once
- write-back new positions to evaluate in global memory
With this kinda macro-threading, exchanging position to evaluate using shared memory between the 2 groups of threads, you may hide totally global memory latency.
In fact, if move generator is too fast, we could balance it dynamically fom the memory-task thread (that is more a programmed cache system!), feeding the move generator with the new position to evaluate that wasn’t written back to memory, ensuring it will consume all available gpu-cycles.
It’s algorithm may be similar to that, finally:
- For each new position to be write-back, do a hash lookup (and if value is known, handle it)
Push into move generator slot, the first new position not in hash-table, while slots are not full
(gives a new positions to evaluate ASAP to evaluator)
- While there’s available move generator slots, pre-fetch positions to evaluate from global memory, and feed them
(please group these reads 
- Write-back in queue new positions generated by the move generator that didn’t fit in the slots
(please group these writes!)
It should limit Global Memory IO, while ensuring maximum usage of the Move generator group of 32 threads.
One caveat: you have to be sure that there will be enough registers to allow the MP to launch these threads at once, elsewhere one of these group (or more) may not be launched simultaneously with the other.