Archive for November, 2009

ATI Radeon 4xxx OpenCL benchmarks

Tuesday, November 10th, 2009

There’s some OpenCL benchmarks out there, and on OpenCL Benchmark that test real GPGPU computation, instead of pure processing power on theorical computation, Radeon 4xxx series lag far far behind of nVidia actual GPU.

ATI Radeon 3xxx and 2xxx are not supported, albeit nVidia’s GPU are supported since 2006 G80 (GeForce 8800 and any GeForce 8 series or later GPU), and Radeon 4xxx are just underperforming, lacking shared memory (memory inside each processor core).

Lacking “shared memory” means that for any data access Radeon 4xxx have to access global video card memory, that is usually 20X to 30X slower, and worse, memory bandwidth on Radeon graphic card are 2X to 3X slower than on nVidia’s. This is not an handicap for games, where radeon are really great graphic card, but it is for GPGPU and OpcnCL.

The result of lacking Shared Memory and slow graphic card memory: a Radeon 4870 (around 200$ street) could not compete with GeForce 9400M IGP (found on Mac Mini, MacBook Air, MacBook…), and a GeForce 9400M iMac will beat any ATI Radeon 4850 iMac when it’s time to compare OpenCL performances! :-(

How-to use CUDA for H.264 encoding?

Tuesday, November 3rd, 2009

CUDA is a powerful technology, incredibly powerful GPU and superb suite of development, debugging and profilin tools. x264 project tried to make it work on their excellent h.264 video encoder (that is blazingly fast with a great video quality on CPU).

They failed, or put it differently, they choose not to use it but consider other way to accelerate encoding such as dedicated hardware accelerators (such as ElGato Turbo.264 HD that I use on my laptop).

There’s many way to follow to use CUDA as H.264 encoder accelerator:

  • Put some cpu-hungry algorithm part to GPU. Was their first choise, but this algorithm seems slower on their implementation than CPU counter-part. FAILED!
  • Put the whole encoding chain to GPU, but as the most computing-intensive part is actually slower on GPU (as they try to implement it) it’s a loss. FAILED!
  • Put the whole encoding chain to GPU, *BUT* give it a different movie part to encode, dynamically, and instead swapping to GPU, aggregate CPU and GPU to do the whole encoding.

The third option is a different way to consider the GPU, not as a co-processor in the middle of a cpu-algorithm, but as an asymmetric computing resource, able to give 10% to 30% performance gain on the whole process.

This is the way I am currently exploring, having in mind to obtain a gain in H.264 encoding over pure-cpu, and to be able to port it to OpenCL with dedicated algorithms for CPU and GPU :-)