Posts Tagged ‘benchmark’

ATI Radeon 4xxx OpenCL benchmarks

Tuesday, November 10th, 2009

There’s some OpenCL benchmarks out there, and on OpenCL Benchmark that test real GPGPU computation, instead of pure processing power on theorical computation, Radeon 4xxx series lag far far behind of nVidia actual GPU.

ATI Radeon 3xxx and 2xxx are not supported, albeit nVidia’s GPU are supported since 2006 G80 (GeForce 8800 and any GeForce 8 series or later GPU), and Radeon 4xxx are just underperforming, lacking shared memory (memory inside each processor core).

Lacking “shared memory” means that for any data access Radeon 4xxx have to access global video card memory, that is usually 20X to 30X slower, and worse, memory bandwidth on Radeon graphic card are 2X to 3X slower than on nVidia’s. This is not an handicap for games, where radeon are really great graphic card, but it is for GPGPU and OpcnCL.

The result of lacking Shared Memory and slow graphic card memory: a Radeon 4870 (around 200$ street) could not compete with GeForce 9400M IGP (found on Mac Mini, MacBook Air, MacBook…), and a GeForce 9400M iMac will beat any ATI Radeon 4850 iMac when it’s time to compare OpenCL performances! :-(

CPU vs CUDA GPU memory bandwidth

Wednesday, July 15th, 2009

What is the memory bandwidth of modern CPU versus that of CUDA-enabled GPU?

As far as I figured it out, I thought GPU memory bandwidth was huge, but I thought that memory bandwidth of CPU L1-cache could be effectively better than actual CUDA architecture.

With all the horsepower delivered by CUDA GPU, up to 10X Gigaflops on GTX than current Core i7/Nehalem processors, we all need to be able to feed them with data and unload results as fast as possible in memory (global videocard memory or computer’s main memory).

I found an interesting article that benchmarked overclocked Core i7 cache and memory bandwidth, in triple-channel with fast DDR3: L1 cache peaks around 50GB/s reading or writing but could do both at once, peaking at 100GB/s, while main computer memory (triple-channel DDR3) was limited to 16 GB/s. That’s actually astonishing anyway, a 3 years old Athlon X2 3800+ (2×2Hz) L1-cache doesn’t deliver more than actual main memory of today!!!

To compare the L1 cache of a CPU (32KB), we should use CUDA Shared Memory (16KB/8 Scalar Processors), and it delivers around 50GB/s too, a value that is strangely similar.

To compare the main memory of the computer we have the Global Memory and it delivers between 100GB/s and 150GB/s, nearly 8X the computer’s main memory bandwidth, due to multiple 64-bits interface (8 instead 3) and higher clock values.

But when you test a shared memory access or a L1-cache access speed, you have to think there’s 4 core on a core i7, each one with it’s dedicated L1-cache, peaking at 200GB-400GB/s depending on the tasks.

On the other side, with 30 groups of 8 Scalar Processors, the Shared Memory of a CUDA GTX 285 may deliver 1500 GB/s, around 4X the aggregated L1-cache of an overclocked Core i7!

To resume, CUDA-enabled GPU offers up to 8X the speed of main memory and 4X the speed of L1-cache compared to a moderne CPU, and it shows!