Archive for August, 2009

The other way around…

Friday, August 21st, 2009

On CUDA GPGPU programming, you will repeatly find this advice “launch as much thread as you can, not hundreds, but thousands”.

Lanching many threads per MultiProcessor helps in many ways:

  • Hide multiplication and other long operation latencies
  • Hide Global Memory & Shared Memory latencies (hum not totally but it helps)
  • Hide Atomic memory operation latencies
  • Thus improve whole performances by enabling to use calculation unit nearly at 100%

I choose to do it the other way around: launch a minimal number of threads per MultiProcessor, to have these advantages:

  • A maximum number of MP Registers available (to avoid using “Local Memory” that is main videocard memory)
  • Diminish pressure on atomic operations (ie: for FIFO queues)
  • A maximum space of Shared Memory per running thread
  • Less communication between threads (that must use Global Memory and Atomic Operations, and scale exponentially!)

But, in this case, my compute threads (see macro-threading) are optimized to *NEVER* do any Global Memory or Atomic operation, to ensure near 100% use of the computing power, so I don’t really needs more than 32 threads (one warp) for computing, and another warp for processing Memory Operations with shared memory buffering.

64 threads (2 warps) per Multiprocessor leads to 128 available registers per threads, enabling me to write much more complex algorithms than with usual 40+ registers when launching 192 threads per MP!

One subsequent effect is that these complex algorithms consumes much more computing power per memory operation, diminishing pressure on videocard memory system, in a virtuous circle :-)

Just upgraded my tools

Monday, August 10th, 2009

No, I didn’t answer to these emails that ask me to buy pills or do some surgery on my most intimate parts :-)

I upgraded from an aluminium 15″ MacBook Pro eto an unibody 17″ MacBook Pro, with 30% faster CPU, 25% faster GPU, 2X more GPU memory, and a bigger … screen!

It’s a pleasure to work with a so good computer, and having both MCP79/GeForce 9400M IGP (that supported Pinned Mapped Memory) and GeForce 9600M GT GPU, I could optimize for both and take into account the performance-level of nVidia’s actual IGP, that is all but negligeable, with peak Gflop/s over thos of the Dual-Core CPU of the Macs.

ATI’s OpenCL CPU-Only!

Friday, August 7th, 2009

While nVidia actually support OpenCL on it’s GPU, but not on main CPU, ATI offers it’s own drivers that support main CPU but not it’s GPU! Anyway ATI’s GPU are not really ceonceived for GPGPU and wil llag far far behind nVidia’s on real OpenCL implementations!

The purpose of OpenCL is to enable code to run on both CPU and GPUs (even a mix of ATI and nVidia), not to enable to run either in CPU (what’s the novelty???) or restricted to a propretary GPU!!!

At this time, CUDA seems to be the technology path to follow before switching to OpenCL in 2010 or 2011…

A great Chess Engine?

Monday, August 3rd, 2009

A great Chess Engine is a mix of different things:

  • great Chess Engines developpers
  • Chess Master level Chess developers
  • Grand Master level consultants
  • Opening book team
  • and means of implementing complex algorithms for selective search

I am not a great chess engine developer, nor a Master-level player (my ELO oscillate between 1200 and 1500. hum…), I don’t have a Grand-Master to help me, and no team for opening book! :-(

And anyway, CUDA technology is much more efficient on simple algorithms, and due to limitation of register numbers and overall architecture, it seems irrealistic to implement the complex (and beautiful) algorithms found in the bests Chess Engine (Rybka, Fritz, Junior, Hiarcs…).

More realistically I may implement simple generic Chess algorithms without a deep understanding of what are doing the Chess Engine itself while playing. That’s limiting, anyway it is for the fun!