The new generation of developers have the habit to count in GB (gigabyte = 1 billion bytes) for main memory and TB (1000 billion of bytes!) for storage space!
I began programming in the 70’s, so mine had the habit of counting in bytes for memory, and not thinking too much about external storage as a means to avoid memory overload, just as a means to access data and store their states, but having in mind that these datas should be as close as the CPU or execution unit as possible!
My first own programmable computer was a TI-57, the one with LED, 50 instructions step, 8 registers (1 dedicated for decrementing loops, 1 for comparison). There wasn’t no useless instructions, execution was real slow too, and you could not afford such luxury as external storage or non-optimized code. A great lesson to use each resource to it’s fullest.
Chess programmers on the 70’s have done with some kind of limitations, imagine a full chess engine on ‘72, on a 4-bit micro-controller (that is 4bit CPU + peripheral on one chip), 2KB ROM, and 80 Bytes of memory (yes 80, organized in 160 x 4bytes). David Levy and it’s team have done that! Incredible for me!
Today, most of our new generation developer think that these limits of the past, or the know-how old developers (as me) have acquired to live with that and produce useful applications with so limited resources, all that is useless and should be put on a Museum…
But if you look at chess on CUDA, you will discover that these limits are actually there, and you’ll have to cope with them, and better don’t waste any storage Byte, ’cause you may regret it:
On each SM, you have 8 SP (Scalar Processor), that executes at least 32 threads to be fully working on basic instructions, and only 16KB or shared ram. yes, that is 512Bytes of RAM for each thread, in a world where you usually allocates Megabytes to any threads just to have it starting! You could use the videocard main memory, you will be limited by total bandwidth of memory, and will have scaring latency. You will even have to launch more thread to hide latencies and use your GPU processing power, ending with memory being a total bottleneck: the more your launch thread, the less each one has shared memory, the more each thread will use main memory. An exponential problem!
So you will have to cope with 512Byte memory per thread, if you want to use each GPU cycle efficiently on basic instructions. And it,s the same wether you consider 2SM/16SP GeForce 9400M IGP, or 16SM/128SP GeForce 9800! The problem scale perfectly, albeit main memory bandwidth doesn’t on high-end card!
Now be prepared to code like David Levy’s have done, Dan & Kathe Spracklen did, and some other famous chess developers of the 70’s: your resources are so limited that you may even struggle just to have the list of move in a given position. 64 bytes for chess board, 218 move possible at worst, 2 bytes per move (packed), you are at 500 bytes for your thread, just 12 bytes (3 32bit word) left! Ouch!
So how to overcome these limitations??? And avoid using video card main memory?