WTB custom ETHMINER for R9 Fury X

IlidanionIlidanion Member Posts: 11
Anyone selling custom miner for this GPU? Please PM me with price and how much MHs is to be expected.

Comments

  • dlehenkydlehenky Member Posts: 2,249 ✭✭✭✭
    I have a Nano, which has the same compute resources as the Fury X (32 int CUs). I've studied and experimented with the OpenCL code quite a bit. The GPU hardware version is simply not going to have much impact on your hashing rate. I could give you a laundry list as to why, but, frankly, I don't have the time to do so. Suffice it to say that the ethash algorithm is "very fu*cking hard", as intended, and imposes a number of roadblocks to increasing the hash rate. Not only is the inner loop hog tied by more or less random accesses to the Dag in global memory, but the loop in bookended by SHA3s, which are notoriously resistant to GPU optimization. The current implementation is very good, as it stands. I've made a number of attempts to better it, and I always end up with the same, or less, performance. Then there's the limitations of OpenCL itself, the main one being that, believe it or not, YOU CAN'T STOP A RUNNING KERNEL. This means the global work size will dictate how long the cl_kernel runs, regardless of whether you have found a block or a new block comes in (someone else found the block). This requires that the global work size be kept small, otherwise you'll continue working on the last block long after the next block arrives. This, in turn, imposes a VERY high OpenCL runtime overhead on the host side, since it constantly has to map the kernel output buffer to check for results and requeue the cl_kernel again for the next batch (global work size). In round numbers, the cl_kernel is only actually running about 50% of the time; the other 50% is consumed by the runtime overhead mentioned above. So, even if you improved the cl_kernel performance by 100%, you'd only see a 50% improvement in your hash rate.

    Believe me, I know this is not what you want to hear, but the Ethereum team had a design objective to make the mining algorithm difficult, and virtually impossible to be implemented on an ASIC, to avoid what has happened to Bitcoin mining. The ethash algorithm, as designed, provides miners with a very level playing field that does not favor deep pocket consortiums that can throw money at FPLAs and ASICs that blow the average joe out of the game. So, if you want more hash rate, buy more GPUs/systems, and incur the recurring operating costs (power) that goes along with that capital investment. As you can see, Ethereum has created a mining environment that favors a large number of small miners, thereby promoting the whole distributed, decentralized ethos that is fundamental to the network's security and success. I, for one, accept their vision and applaud them for their efforts.

    If anyone else out there has had other experiences with optimizing the GPU miner, I'd love to hear it. I'm only relating my own humble attempts and may well be utterly full of shit :)
  • farwarefarware Member Posts: 116
    dlehenky said:

    So, if you want more hash rate, buy more GPUs/systems, and incur the recurring operating costs (power) that goes along with that capital investment. As you can see, Ethereum has created a mining environment that favors a large number of small miners, thereby promoting the whole distributed, decentralized ethos that is fundamental to the network's security and success. I, for one, accept their vision and applaud them for their efforts.

    I second that. Although we do have some large mining farms, this is great for us little guys. You can build rigs and actually participate. Ignore the noise and kids. Ethereum is here to stay
  • Maren85Maren85 Member Posts: 41
    Well, it keeps things interesting, who knows what the future will hold? The Ethereum team is burning through the money and will soon run out and have to downsize, POS is a year or two away and will finish of the miners, big companies are taking keen notice of Ethereum, anything could happen, exciting times...
  • IlidanionIlidanion Member Posts: 11
    dlehenky said:

    I have a Nano, which has the same compute resources as the Fury X (32 int CUs). I've studied and experimented with the OpenCL code quite a bit. The GPU hardware version is simply not going to have much impact on your hashing rate. I could give you a laundry list as to why, but, frankly, I don't have the time to do so. Suffice it to say that the ethash algorithm is "very fu*cking hard", as intended, and imposes a number of roadblocks to increasing the hash rate. Not only is the inner loop hog tied by more or less random accesses to the Dag in global memory, but the loop in bookended by SHA3s, which are notoriously resistant to GPU optimization. The current implementation is very good, as it stands. I've made a number of attempts to better it, and I always end up with the same, or less, performance. Then there's the limitations of OpenCL itself, the main one being that, believe it or not, YOU CAN'T STOP A RUNNING KERNEL. This means the global work size will dictate how long the cl_kernel runs, regardless of whether you have found a block or a new block comes in (someone else found the block). This requires that the global work size be kept small, otherwise you'll continue working on the last block long after the next block arrives. This, in turn, imposes a VERY high OpenCL runtime overhead on the host side, since it constantly has to map the kernel output buffer to check for results and requeue the cl_kernel again for the next batch (global work size). In round numbers, the cl_kernel is only actually running about 50% of the time; the other 50% is consumed by the runtime overhead mentioned above. So, even if you improved the cl_kernel performance by 100%, you'd only see a 50% improvement in your hash rate.

    Believe me, I know this is not what you want to hear, but the Ethereum team had a design objective to make the mining algorithm difficult, and virtually impossible to be implemented on an ASIC, to avoid what has happened to Bitcoin mining. The ethash algorithm, as designed, provides miners with a very level playing field that does not favor deep pocket consortiums that can throw money at FPLAs and ASICs that blow the average joe out of the game. So, if you want more hash rate, buy more GPUs/systems, and incur the recurring operating costs (power) that goes along with that capital investment. As you can see, Ethereum has created a mining environment that favors a large number of small miners, thereby promoting the whole distributed, decentralized ethos that is fundamental to the network's security and success. I, for one, accept their vision and applaud them for their efforts.

    If anyone else out there has had other experiences with optimizing the GPU miner, I'd love to hear it. I'm only relating my own humble attempts and may well be utterly full of shit :)

    You killed my eth boner
  • IlidanionIlidanion Member Posts: 11
    Maren85 said:

    Well, it keeps things interesting, who knows what the future will hold? The Ethereum team is burning through the money and will soon run out and have to downsize, POS is a year or two away and will finish of the miners, big companies are taking keen notice of Ethereum, anything could happen, exciting times...

    They are not being very smart when they prevent ASICS... They should allow ASICS and price would rise to Saturn and their funding problems would be gone for the next 1000 years, but NO...

    I should be the f*ing President of Ethereum and show how stuff gets done.
  • dlehenkydlehenky Member Posts: 2,249 ✭✭✭✭
    NexusX said:

    Maren85 said:

    Well, it keeps things interesting, who knows what the future will hold? The Ethereum team is burning through the money and will soon run out and have to downsize, POS is a year or two away and will finish of the miners, big companies are taking keen notice of Ethereum, anything could happen, exciting times...

    They are not being very smart when they prevent ASICS... They should allow ASICS and price would rise to Saturn and their funding problems would be gone for the next 1000 years, but NO...

    I should be the f*ing President of Ethereum and show how stuff gets done.
    Well, first of all, ether isn't a currency, per se, it's "gas" to fuel the transaction engine. Ether, by it's very nature, will be spent on a very large number of relatively small transaction fees; it is not intended to be used to directly purchase a car, for example. If you want Bitcoin, it's easy to find :) Second, perhaps you didn't read the blog entry recently about Microsoft *sponsoring* DevCon1 and attending it. Do you really think the Etheream Foundation is going to have a funding problem going forward? I think not. It's clear that Ethereum is drawing some serious attention from large entities that see the real potential of it for practical applications. It is chump change for those entities to see to it that the the core Etheream team remains intact. Just my take on it.
  • GenoilGenoil 0xeb9310b185455f863f526dab3d245809f6854b4dMember Posts: 769 ✭✭✭
    edited November 2015
    @dlehenky i have some optimized Keccak for OpenCL loitering around, but it doesn't do that much indeed. 1% at best. Perhaps a bit more is possible if you port parts of my CUDA code back to OpenCL, although in CUDA those improvements are largely cosmetic and drown in the dagger loop.
    Post edited by Genoil on
  • dlehenkydlehenky Member Posts: 2,249 ✭✭✭✭
    Genoil said:

    @dlehenky i have some optimized Keccak for OpenCL loitering around, but it doesn't do that much indeed. 1% at best. Perhaps a bit more is possible if you port parts of my CUDA code back to OpenCL, although in CUDA those improvements are largely cosmetic and drown in the dagger loop.

    Hey Genoil. I think I might have seen a post of yours on a different forum when I was poking around for keccak/opencl bits. Did you mention you had undone the unions in the CL code when you ported to CUDA? Anyway, on AMD OpenCL, at least, the real bear with the SHA3 is the bitselect instructions: they absolutely inhale VGPRs and thereby limit the number of wavefronts you're going to get. Can't do much about that. Perhaps the next gen cards at 14-16 nm will use dedicated registers for the bitselects or just double the number of VGPRs. Going from 28 nm, they'll have a ton of real estate to work with, for sure. AMD seems to be very aware of the upswing in the use of graphics cards for compute-only applications (read: mining). It would be great to see cards with just GPU and memory, less all the video circuitry that is obviously a complete waste, at a somewhat lower price point and power consumption.
  • GenoilGenoil 0xeb9310b185455f863f526dab3d245809f6854b4dMember Posts: 769 ✭✭✭
    edited November 2015
    @dlhenky it may have been the ccminer thread at bitcointalk. Lurking about there until sp_ finally beats me at ethminer.

    Anyway I understand what you mean with the register pressure. And apparently the AMD compiler isn't very cooperative either.

    The elimination of unions can make a small difference. By maintaining state in a single uint2[25] you can get rid of a lot of useless conversion and mov instructions.

    Then in SHA3/Keccak you can precompute about 1/2 of the instructions in the first round on the host and for each stage eliminate about 1/3 to 1/2 of the instructions in the last round.

    In the dagger stage there's some low hanging fruit like replacing some of the modulo's with &, but I got most out of the CUDA warp shuffle. Apparently AMD's can do it too, but you'll have to go to GCN assembly, which can't be inlined in OpenCL easily.

    Btw the improvement you could make with immediate termination of the kernel is marginal. Say you hash at 25MH with 16384 * 64 worksize, that's roughly 25 batches / second, 40ms. If you could exit immediately, you'd be only 20ms faster on average.

    Post edited by Genoil on
  • dlehenkydlehenky Member Posts: 2,249 ✭✭✭✭
    @Genoil Interesting. Thanks for the info! I'm an old fart, and all this is great fun for me :) The immediate termination improvement is not just about latency, but about constantly requeueing the kernel and remapping the result buffer. If you run with a global work size any bigger than the 16384 (* 64) you mentioned, you can see the delay in the logs getting to the next block, and it's a lot more than 20ms. What you'd really like to do is run the kernel for a second or two at a shot. Blocks are rarely solved in less time than that, and it avoids a ton of host-side/runtime overhead, so the kernel can just run flat out. The limitation of not being able to get the results in real-time and, with low overhead, terminate the kernel, is bigger than you think, I believe. Of course, the same thing applies to receiving a new block without ever getting a result, which is, by far, most of the time. (I realize I'm not telling you anything you don't already know.) If you profile the kernel with CodeXL, even with a larger global work size, as above, there's an awful lot of dead time, when the kernel isn't running at all. The host side spends more than 90% of it's api time mapping the result buffer to check it, 99.99999%+ of the time for nothing, before queueing the next kernel. In the end, all I can say is: if the kernel isn't running, your not hashing, and about 50% of the time, it's not running. Most of the shortcomings of OpenCL I see can be attributed to the whole mantra of portability, in other words: designing to the lowest common denominator. Portability is fine, but the spec should recognize the need for a HPC platform to be optimized for a specific hardware environment when needed. Can't wait for Vulkan; that should be quite an improvement, from what I read. Thanks for the conversation!
  • GenoilGenoil 0xeb9310b185455f863f526dab3d245809f6854b4dMember Posts: 769 ✭✭✭
    @dlehenky I'm not very familiar with CodeXL, but if it is a bit like CUDA's nvprof/ NSight, it probably looks like the host side is busy mapping the result buffer, while actually it is just waiting for the GPU to complete its task. The EnqueueNDRangeKernel or whatever it's called is called asynchronously, so the host moves on the call that copies back the results, but there it just stalls waiting for the kernel to complete. In the CUDA port this bit is optimized by @RoBiK in order to save CPU load, but it doesn't do much with performance.

    What I meant with the 20ms (on average) is this. If a kernel takes 40ms to run, exiting it immediately would happen somewhere between 0ms and 40ms, so 20ms on average. That's nothing on a 15 second block time.
  • dlehenkydlehenky Member Posts: 2,249 ✭✭✭✭
    @Genoil Yes, I see what you are saying about the buffer mapping. It occurred to me that that might be the case, since the time spent on the host in the buffer map call is very similar to the kernel execution time :) However, there remains the obvious and measurable breaks in the time-line trace between kernel executions. Perhaps it's simply the enqueue overhead? Still digging. I'll probably get to China before I see daylight :) Your feedback is much appreciated. The Kronos and AMD documentation sucks, as far as I'm concerned. Linux man pages provide much more detailed information than the OpenCL crap. I take it from what you said above re: buffer mapping that the buffer can't be mapped while the kernel is running? That is the type of information that is just not contained in the OpenCL documentation, at least not that I've been able to locate.
  • GenoilGenoil 0xeb9310b185455f863f526dab3d245809f6854b4dMember Posts: 769 ✭✭✭
    dlehenky said:

    I take it from what you said above re: buffer mapping that the buffer can't be mapped while the kernel is running? That is the type of information that is just not contained in the OpenCL documentation, at least not that I've been able to locate.

    @dlhenky i'm not sure about OpenCL but I'm doing that in CUDA using cudaStreams. These basically allow overlapping of kernels and host<->device transfers.
Sign In or Register to comment.