On CUDA GPGPU programming, you will repeatly find this advice “launch as much thread as you can, not hundreds, but thousands”.
Lanching many threads per MultiProcessor helps in many ways:
- Hide multiplication and other long operation latencies
- Hide Global Memory & Shared Memory latencies (hum not totally but it helps)
- Hide Atomic memory operation latencies
- Thus improve whole performances by enabling to use calculation unit nearly at 100%
I choose to do it the other way around: launch a minimal number of threads per MultiProcessor, to have these advantages:
- A maximum number of MP Registers available (to avoid using “Local Memory” that is main videocard memory)
- Diminish pressure on atomic operations (ie: for FIFO queues)
- A maximum space of Shared Memory per running thread
- Less communication between threads (that must use Global Memory and Atomic Operations, and scale exponentially!)
But, in this case, my compute threads (see macro-threading) are optimized to *NEVER* do any Global Memory or Atomic operation, to ensure near 100% use of the computing power, so I don’t really needs more than 32 threads (one warp) for computing, and another warp for processing Memory Operations with shared memory buffering.
64 threads (2 warps) per Multiprocessor leads to 128 available registers per threads, enabling me to write much more complex algorithms than with usual 40+ registers when launching 192 threads per MP!
One subsequent effect is that these complex algorithms consumes much more computing power per memory operation, diminishing pressure on videocard memory system, in a virtuous circle