Archive for August, 2009

OpenCL supported in Snow Leopard, but…

Friday, August 28th, 2009

There’s two shortcomings…

The first is you could throw away ATI HD 2400/2600 on iMac, ATI HD 2600 XT on Mac Pro, in fact any ATI card except 4850 & 4870 found in latest iMac and MacPro. Naturally any GeForce 8000 series, 9000 series or GT/GTS/GTX is compatible with OpenCL & CUDA.

I thought that ATI has problems to implement OpenCL due to lack of supporting hardware, and I suspect that even on 4000 series, performances on GPGPU (General Purpose computing on GPU) will lag behind those found on nVidia’s due to lack of “real” Shared Memory.

The second short coming is that I didn’t see any information about support of CPU and multi-core CPU on OpenCL. OpenCL is not designed to make software that run on GPU, but to enable software to run on *ANY* modern GPU and *ANY* CPU too, too unleash the full potential of a computer.

I will have to check for it, and report numbers this week-end :-)

Why nobody talk about OpenCL in SnowLeopard

Thursday, August 27th, 2009

Actually as I read firsts reviews of Snow Leopard (due date tomorrow :-) ), I don’t read anything about OpenCL, exactly as if this incredible feature has been thrown away or post-poned…

Tomorrow, I will wait in line at the AppleStore Montreal to buy one, install it on my MacBook Pro to check it on a little 40GB partition, and try to make good use of OpenCL, check if OpenCL may use the dual-core CPU, the GeForce 9600M GT and also the GeForce 9400M, and moreover, all these at the same time to deliver impressive performance-level.

Think about it, on flopsBenchmark for _BIG_MAC, 9400M delivers up to 39 GFlops, 9600M GT up to 89 GFlops and CPU may deliver around 20 GFlops, for a grand total of 148 GFlops, 7X the CPU performance!!!

CUDA & OPenCL are not useful…

Wednesday, August 26th, 2009

After firsts tests and benchmarks of OpenCL, showing that a GT120 (GeForce 9500GT renamed), an $60 entry-level card is equivalent for pure floating point computing than 2 quad-core Xeon at 2.93Ghz (a beautiful $2000 price tag), critics are writing that CUDA & OPenCL are of limited use.

They only see them for mathematical calculation, financial analysis (that is math too), and some niche markets.

They may have wrote the same from the apparition of the computer in the 50’s or the microcomputer in the 70’s!

Some usage of OpenCL & CUDA:

  • Any platform – virus detection accelerator
  • MySQL – Sort accelerator/Memory table accelerator
  • Apache & https – SSL accelerator
  • OS – Encryption accelerator (full encrypted HD or communication)
  • Photoshop – Image edition accelerator
  • Web – thumbnail generators
  • Video encoding – accelerator
  • Video decoding – accelerate proprietary format TOO
  • Sound Mixing/Editing – DSP simulator, hardware accelerated
  • iPhoto – Face detection & face recognition
  • iTunes – Sound Equalizer, reverb, etc to enhance sound quality
  • Games – Physics computations (aka Physx on nVidia)
  • Web browser – Image decompression, accelerated UI
  • Chess Game – 1 billion node/second on a PC or Mac!!!

Most of them are actually implemented, in studies, in experiments, in open-source code, in proprietary software, and in the next years, many of these tasks that is heavy for the CPU will rely more and more on a GPU.

Pinned Mapped Memory

Tuesday, August 25th, 2009

As you saw on my last post, I have one GeForce 9600M GT and an IGP 9400M. This last one have one advantage, the Pinner Mapped Memory.

Pinned Mapped Memory is host memory that is Mapped into space on both host CPU and software, and GPU, enabling them to exchange data asynchronously while they are both working, without interruption or need to finish execution of a kernel.

For example in a Chess Engine, the CUDA kernel could send back real-time informations to the host application, to be displayed in real-time (such as main continuation, search-depth, evaluation, node counts, …), on the other side the host could send back informations to the kernel without interrupting it, and even enable kernel-kernel communication between different GPU!!!

This feature is shared among MCP79 (9300M/9400M) and G200 (GTX260/275/280/285 & 295) and is a great one to ease communication between host and kernel.

both GPU recognized on actual MacBook Pro

Sunday, August 23rd, 2009

The actual line of MacBook Pro have 2 GPUs: a 9400M IGP (MCP79) w/ 256MB shared DDR3, and a real GPU 9600M GT w/ 512MB video-ram.

But Apple advertised that you could just use one at once, either the 9400M GT IGP, in maximal autonomy mode, or the 9600M GT in maximum performance mode, to select on battery saver preference…

The reality is really different…

When you start using the 9400M IGP, the 9600M GT is disabled and doesn’t appear on CUDA’s deviceQuery. Normal, shutting down the 9600M GT diminish energy consumption.

But when you start using the 9600M GT, in Maximal Performance mode, the 9400M GT that is part of the chipset is not disabled: it appears inĀ  deviceQuery and moreover it could be tested using bandwidthTest -device=1


CUDA Device Query (Runtime API) version (CUDART static linking)
There are 2 devices supporting CUDA

Device 0: "GeForce 9600M GT"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 536543232 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.78 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 1: "GeForce 9400M"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 266010624 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.40 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Needless to say is is a great new for me, enabling me to check real GPU code against IGP, and moreover beginning to use them both to have GPU load-balancer working with asymmetrical SLI.

:-)