Building my GPU rig. Still haven't settled on a graphics card just yet. It seems the "standard" GPU of choice is an R9 280X, or similar. It is stated Radeon GPUs get better perfomance overall compared to NVidia based devices. However, in terms of a rough comparison, I did some googling to see how they compare
512 CUDA cores vs 2048 Stream processors (I realize they don't directly compare)
1.3 GHZ vs 1 GHZ clock speed
6 GB GDDR5 vs 3 GB GDDR5 memory
177 gb/s vs 288 gb/s memory bandwith
1.85 ghz vs 6 ghz memory clock speed
225 Watts vs 250 watts power consumption
It's really strange to compare the two. More memory, slower and less bandwidth, but a higher clock speed?
https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU#Why_are_AMD_GPUs_faster_than_Nvidia_GPUs.3FScrolling down to the explanation of why AMD mines faster than Nvidia is because in the SHA256 algorithm, many shift instructions are used. Since ethash makes use of Kekkak, I decided to check what it mainly uses, which is XOR. There's still two shift instructions, one of which might require many iterations to shift. AMD fires back though with a single rotate-wherever bit align instruction. Perhaps the extra .3 GHZ could make up for this?
I'm rather unsure of whether or not the M2090 is worth the hashes/$ for $145 on ebay, whereas I could spend about $20 more to get something that can output an average of 20 MH/s.
...Then again, I read someone's benchmarks of their Quadro 4200, which has 448 CUDA cores @ 1.6 GHz, and 4 GB of GDDR5 memory @ 173 gb/s, and managed to pull around 19 MH/s. Both devices are, of course, using CUDA 2.0, and are Fermi-based architectures, so it's logical to conclude that if the 4200 can use the mining software, so can the M2090
So... slightly slower GPU with more cores, twice the memory, a slightly faster memory, and less power consumption, *MIGHT* prove advantageous over an R9280X.
I could experiment to see just what it can do. Returns within 30 days for the tesla are accepted if it doesn't work out

What do you miners think? Deal or no deal?
Comments
I would have liked to the go the Nvidia route due to the lower power use but their higer end cards are just to expensive.
its all about density if you can run six or so of the nvidias one one mother board and use less power it would be better than running the power hunger R9s for a few more mh/s of hashes.
Plus the R9s run hot and will throttle down anyways. I went th R9 route becase I already had a bunch of those and I only don't want to manage 3-4 motherboards worth of Nvidia cards for the same hashrate of AMD cards using only 2 motherboards.
When I was spread sheeting GPU efficiencies I ended up dropping all the Tesla's and Quatros because their performance/cost was so low
As for the algorhytm: the Keccak/SHA256-3 stages of ethash are relatively small compared to the dagger stage. And that stage has a lot of random memory access of tiny bits of data, surrounded by a lot of XORs and MULTs. So by design the hashrate shoudl scale with memory bandwidth.
My guess is that AMD still wins over NVidia because of faster integer ops as cards with similar memory bandwidth perform quite a bit worse. By comparsion, on the HD7850 the opencl kernel was developed on, the bandwidth is 81% of theoretical maximum. On my GTX780 with native CUDA kernel, I get only 47%. This means the NVidia cards cannot fully hide memory latency because of slower integer ops. Maybe it can be improved by increasing thread occupancy, which is about 50%. Don't have any comparison on AMD.
I would to try out ethashing on a Tesla K80 though. 100% thread occupancy is possible there because of the double register size. My request to NVidia for a test drive was denied however
Here's some graphs from NSight to illustrate the bandwidth dependency:
And this gives an idea about distribution of ALU workload. XOR fall within the Logical Ops. Shuffle has nothing to do with shifting btw. It's some CUDA specific thing that I tried to take advantage of to no avail
There's Genoil's Cudaminer - is that not usable for frontier?
Also, wouldn't you think that while the 280X's memory bandwidth is nearly twice that of the M2090, there'd be an advantage with the M2090 having twice the memory?
I'm still most curious as to why the Quadro 4200 performs so well. Perhaps it's the extra GB of memory compared to the 280X?