Posts Tagged ‘benchmark’

Debunking the 100X GPU vs. CPU Myth: an Intel Paper

Monday, June 28th, 2010

Download and read the intel paper about debunking the myth of GPU being 100X faster than CPU on some tasks.

This paper is interesting, there are flaw in it, but I truly think the whole part is honest, and this honesty goes up to the point where Intel is saying that nVidia GPU are 14X faster than Intel’s CPU at best, and average 2.5X faster. I think that depending on the selected set of tasks, it might be true or a little conservative. Anyway Intel honestly said that their CPU are lagging far behind GPU.

They compared a $600 CPU and a $300 GPU, and the GPU was 2.5X faster average. Depending if money or performance interest you, you could consider different conclusions…

Upgrade to an old PC

To upgrade an old PCI-Express PC, you could add a $100-$120 graphic card, to obtain performance-level of a brand-new Core i7 960 PC. Core i7 960 needs new socket so an old PC won’t be compatible, you will have to spend $1000 for it (or more!). Given you resell your old PC for $300, Intel CPU are juste 6X-7X more expensive…

High-Performance PC

You may prefer to invest on a high-performance computer, because you need it, so you could choose for example a Mac Pro with 8-core 2.93Ghz (2 quad-core Nehalem CPU), that is 1.8X faster than the Core i7 960 in pure computing, for a tag price of $5899. Or for the same average computation power, take a $500 desktop and add it a $200 GeForce GTX260, $700 investment total. 8.4X more expensive on Intel CPU.

Desktop Supercomputing

Or maybe compare a high-performance 8-core PC/Mac with what this money could give you in term of performance for the $5899 tag price, giving you are buying a basic Mac Pro 4×2.66 Ghz $2499 and upgrade it with $3400 worth of GPU and second power… $1000 for a superb 1000W power, and 3xGeForce GTX 480 (each one at least 1.5X faster than GTX 285), PCI-Express expander.

On my left 8×2.93 Intel CPU power, at my right 3xGeForce GTX 480, for the same budget, on high-end computing, you will get 5X to 6X more performance average, and sometimes 20X to 30X if your task is GPU-oriented. Ouch!

Conclusions

If you want to upgrade an old PC to double it’s computing power or more on CUDA-enabled tasks, consider buying a new high-end PC/Mac, or want to have the fastest desktop supercomputer, wether it’s performance/price that drives you or performance-level, in any case there’s no competition to GPU.

Intel isn’t competing anymore in high-end supercomputing when we are talking about tasks that GPU handle correctly and are massively scalable. Whatever your metric is, nVidia GPUs are far better.

nVidia GTX 480 first OpenCL benchmarks

Saturday, March 27th, 2010

With SISoftware Sandra 2010, there’s an OpenCL benchmark (that run under Windows).

blog-opencl-sandra

GTX 480, the new Fermi architecture posted impressive results in floating points, approximately of the level of Radeon 5970 (dual-GPU!), and 70% faster than a Radeon 5870 (1-GPU). This is the basic “MAD” or “FMAD” test that didn’t correspond to any real-world use of OpenCL, and just a measure as meaningful as Linux Kernel’s “BOGOMIPS“.

But it show that the potential of the Fermi is really there, and if the caching-system coupled to the unified memory architecture (and address) works as expected, GTX 480 & 470 will beat everything that exists under the sun :-)

PS: Another OpenCL Benchmark of the GTX 480 by Anandtech, that show up to 10X faster than GTX 285 (ray-tracing) and 2X to 4X faster than Radeon 5870!

ATI Radeon 4xxx OpenCL benchmarks

Tuesday, November 10th, 2009

There’s some OpenCL benchmarks out there, and on OpenCL Benchmark that test real GPGPU computation, instead of pure processing power on theorical computation, Radeon 4xxx series lag far far behind of nVidia actual GPU.

ATI Radeon 3xxx and 2xxx are not supported, albeit nVidia’s GPU are supported since 2006 G80 (GeForce 8800 and any GeForce 8 series or later GPU), and Radeon 4xxx are just underperforming, lacking shared memory (memory inside each processor core).

Lacking “shared memory” means that for any data access Radeon 4xxx have to access global video card memory, that is usually 20X to 30X slower, and worse, memory bandwidth on Radeon graphic card are 2X to 3X slower than on nVidia’s. This is not an handicap for games, where radeon are really great graphic card, but it is for GPGPU and OpcnCL.

The result of lacking Shared Memory and slow graphic card memory: a Radeon 4870 (around 200$ street) could not compete with GeForce 9400M IGP (found on Mac Mini, MacBook Air, MacBook…), and a GeForce 9400M iMac will beat any ATI Radeon 4850 iMac when it’s time to compare OpenCL performances! :-(

CPU vs CUDA GPU memory bandwidth

Wednesday, July 15th, 2009

What is the memory bandwidth of modern CPU versus that of CUDA-enabled GPU?

As far as I figured it out, I thought GPU memory bandwidth was huge, but I thought that memory bandwidth of CPU L1-cache could be effectively better than actual CUDA architecture.

With all the horsepower delivered by CUDA GPU, up to 10X Gigaflops on GTX than current Core i7/Nehalem processors, we all need to be able to feed them with data and unload results as fast as possible in memory (global videocard memory or computer’s main memory).

I found an interesting article that benchmarked overclocked Core i7 cache and memory bandwidth, in triple-channel with fast DDR3: L1 cache peaks around 50GB/s reading or writing but could do both at once, peaking at 100GB/s, while main computer memory (triple-channel DDR3) was limited to 16 GB/s. That’s actually astonishing anyway, a 3 years old Athlon X2 3800+ (2×2Hz) L1-cache doesn’t deliver more than actual main memory of today!!!

To compare the L1 cache of a CPU (32KB), we should use CUDA Shared Memory (16KB/8 Scalar Processors), and it delivers around 50GB/s too, a value that is strangely similar.

To compare the main memory of the computer we have the Global Memory and it delivers between 100GB/s and 150GB/s, nearly 8X the computer’s main memory bandwidth, due to multiple 64-bits interface (8 instead 3) and higher clock values.

But when you test a shared memory access or a L1-cache access speed, you have to think there’s 4 core on a core i7, each one with it’s dedicated L1-cache, peaking at 200GB-400GB/s depending on the tasks.

On the other side, with 30 groups of 8 Scalar Processors, the Shared Memory of a CUDA GTX 285 may deliver 1500 GB/s, around 4X the aggregated L1-cache of an overclocked Core i7!

To resume, CUDA-enabled GPU offers up to 8X the speed of main memory and 4X the speed of L1-cache compared to a moderne CPU, and it shows!