Random GPU Hangs (Any GPU, Any Slot, Any Time)

Rigboy786Rigboy786 Member Posts: 45
edited March 2018 in Mining
Hi Everyone

Firstly - Just want to say this forum is by far the most informative and most helpful I've come across... and I feel bad I've recently opened a few discussions for help that I needed but I've not really had the time to give any help back to others so I will from today try to be more interactive as well. I do however try to keep all my discussions updated with latest resolution in case someone else with the same issue does land on it :smile:

Before I dive right into the problem, I want to make this thread as in-depth as possible because from searching day & night on the net I've seen a lot of people are having a similar issue and struggling to find a solution. Hopefully, this will give me best chance at finding a solution and helping others.

My Build:
MSI Z170A Tomahawk Motherboard (Purchased New)
Intel Pentium G4400 Skylake LGA 1151 Processor (Purchased New)
4x 4Gb (16Gb) DDR4 DIMM Ram 2300Mhz (Purchased Used)
3x R9 290 & 3x R9 290x GPUs (Purchased Used)
2x 1000w Corsair PSUs (2kw) (Purchased 1 New + 1 Used)
Currently running on Windows 10 Pro 64-Bit
Powered Risers (Latest Ver 008S)

Windows Settings:
Windows Update > Disabled
AMD Overdrive > Disabled
Windows Defender > Apps & Files > Disabled
Virtual Memory > Increased to 20,000Mb (20Gb)
Battery & Performance > High Performance (Never Sleep, Never Hibernate, Never Screen Off, PCIE Never Off)

BIOS Settings:
Updated to latest MSI BIOS from MSI website > Yes (Successful, no errors)
PCIE Speed> Auto
Dim Link Speed > Auto
4G Decoding / Multi GPU Mining > Enabled
Primary Boot Display > PGE
VD-T > Disabled
HD Audio Controller > Disabled
Serial Port > Disabled
Parallel Port > Disabled
Power Management > Power On after AC Power Loss > Enabled
Boot Sequence > USB then SSD

***THE PROBLEM***
I started with Simplemining.net (smOS) as my preferred OS. I could get 3 GPUs detected and start mining perfectly but after a short time (10min-5hrs) I would see one GPU stop hashing. It can be any GPU, on any slot at any time. I then changed all my risers to the new Ver 008S risers and managed to get 5 GPUs detected but the problem was still there, one random GPU would hang (any gpu, any slot, any time).

When a GPU stops hashing, the others continue for a while but after a minute or two the miner reloads and the GPU starts hashing again. Then it will hang again or another might hang... and sometimes 2 can hang and the third might drop from 26Mh/s to something like 15...8...1Mhs and then the whole system would reboot. Sometimes it might force a reboot straight after the first GPU hanging as well. It's very random.

I have since tried HiveOS and Perfectmining.io but all three give the same issue - I then assumed it was an Ubuntu issue as some said changing to Windows resolved it for them. I also took the time to open all my GPUs and clean them.. cleaned the board, the fan, intercooler and heatsink & I applied new thermal paste.

Yesterday installed Windows 10 Pro, updated it and then switched off further updates. I installed the Adrenaline drivers from AMD (Latest AMDGPU Pro) and began hashing on all 5 GPUs using Claymore V10.2 but the same thing started to happen. I then downloaded Claymore V11.1 and again, the same thing was happening.

If I noticed that GPU2 for example would hang like 3x in a row, I would shut down; disconnect GPU2 and then reboot but then it would be another GPU that hangs, eventually I'd be left with 1 GPU connected and still it can hang randomly.

This morning I've uninstalled the latest AMD drivers using DDU and then installed the 15.12 drivers as recommended on a lot of sites for older cards. I noticed fan control conflicts with miner, msi afterburner and amd overdrive so I disabled amd overdrive and added -tt 1 to the miner config, this has resolved fan control issues.

I've tried again with Claymore V11.1 but it hanged pretty much straight away with just 1 GPU connected. So I'm currently testing with Claymore V10.0 and with 1 GPU connected it is so far running for about an hour (Dual Mining ETH + DCR)... Eth = 42shares so far + DCR = 40 shares so far with 1 rejection on DCR and 0 on Eth. - I'm thinking to let this run overnight before considering connecting a 2nd GPU. It's currently 3:30pm here so that would be a decent 15hr test I think, which would be the lost i've managed to get it to run lol.

Some people suggest clocking back or reflashing GPU bios but all my GPU's are showing STOCK OC on MSI, I did not OC any of them and I haven't messed with flashing any GPU bios. I also don't believe previous owners would have flashed them because from what I've read online there is no point doing that with the R9's.



I've been having this problem for about 5-6 weeks now and I don't know what else I can try. I checked my PSU's and with 5 cards connected they are taking 550w + 740w from the wall. Well under the 1000w capacity but anyway the problem happens with even 1-2 GPUs connected so it's not a PSU issue.

Can someone please help me and guide me through how to find the problem and rectify it? :disappointed:
«1

Comments

  • mchu3599mchu3599 Member Posts: 29
    What are the temperatures you are seeing? Simplify everything, change claymore to single mine instead of duel mine, use only 1 gpu. Do not run MSI AB concurrently with Claymore, I have seen too many problems with that. Set your -tt to 60, -fanmin to 50 and -fanmax to 100 (you can fine tune this later)
  • Rigboy786Rigboy786 Member Posts: 45
    mchu3599 said:

    What are the temperatures you are seeing? Simplify everything, change claymore to single mine instead of duel mine, use only 1 gpu. Do not run MSI AB concurrently with Claymore, I have seen too many problems with that. Set your -tt to 60, -fanmin to 50 and -fanmax to 100 (you can fine tune this later)

    Temps stay around 78-85c (These cards like to run hot and are rated to 95c)
    Currently my 1 connected GPU is at 81c with fan at 93%

    This problem was happening on single-mine mode originally (I had -mode 1 in config file) but dual mining has made no difference. Also, it was happening without MSI AB so I don't think that will be connected.
  • mchu3599mchu3599 Member Posts: 29
    You said this was happening with different miners/OS? If that is the case then it is no longer a SW issue. Is it possible to try your GPUs on a different M/B using a bootable miner like simplemining.
  • Rigboy786Rigboy786 Member Posts: 45
    mchu3599 said:

    You said this was happening with different miners/OS? If that is the case then it is no longer a SW issue. Is it possible to try your GPUs on a different M/B using a bootable miner like simplemining.

    Yep, smOS, HiveOS and pmOS essentially all using Ubuntu.. so that was different miners on the same OS. Now I'm on Windows 10 but I read online that my R9 290's are older cards and have instability issues with the latest AMD drivers and the latest claymore... it's also mentioned in the Claymore V11.1 Readme. - So for this reason I've not yet changed the mobo, instead I rolled my drivers back to the recommended 15.12 and using Claymore V10.0 right now.

    With 1 GPU Connected Stats Are:

    Eth - 66 Shares @ 25Mh/s (0 Rejected)
    Dcr - 51 Shares @ 750Mh/s (2 Rejected)
    Temp: 81c - Fan 94%

    I think the longest I've ran was 7 hours and that was 1 GPU before the crash occurred.
  • Rigboy786Rigboy786 Member Posts: 45
    Update:

    I just caught my GPU trying to hang... my hashrate dropped to 6Mh/s and I heard the fans drop real low to idle... luckily I caught it on my scope graphics, the CPU usage went to near 100% and when I opened Task Manager to check the processes... look what i found (see images)... the top 3 tasks were fluctuating and taking a considerable amount of CPU... when they stopped, everything resumed back to normal.

    - Obv these processes are not occurring in Ubuntu... and I'm not sure how safe it would be to turn them off. Does anyone know a way to disable them? If I find a way I will update.
    -


  • mchu3599mchu3599 Member Posts: 29
    I assume you have windows auto update and anti-virus turned off right?
  • Rigboy786Rigboy786 Member Posts: 45
    Yes but Antimalware works still so I've disabled that and Windows Search Indexer... my CPU usage is now flatline zero and has been for the past 1 hour :smile: I've tried uploading new screenshot of the result and links to how to disable it in case anyone else wishes to, but my comment is in moderation queue. - Still might leave this running overnight though before I shutdown and connect a second GPU.
  • iamnoobplzhelpiamnoobplzhelp Member Posts: 239 ✭✭
    For me, hangs always have to do with power or clocks.

    Just for a test, try underclocking it as much as possible and over volting (voltage + like 25mv and power limit + like 5%). Run that for 24 hours and see if that helps.
  • JukeboxJukebox Member Posts: 640 ✭✭✭
    edited March 2018
    :/
    PCIE Speed> Auto
    PCIE Speed> PCI-E Gen. 1.
  • lablettlablett Member Posts: 333 ✭✭
    Mining with multiple GPU's can cause much stress on your system, as miners pause, then mine etc. Hence I think there is a component failing somewhere. Since the card failing is random I suspect an issue with your memory or one of PSUs. Since you only require 8gb can you please try different memory sticks with only 8 GB. If this do not isolate the problem then reduce the number of cards one by one. Also can you install GPUZ and tell the ASIC quality on all the cards ?
  • Rigboy786Rigboy786 Member Posts: 45
    @iamnoobplzhelp I didn't try this yet as all cards are stock, I just assumed they should work as stock cards and play up when you OC without knowledge.

    @Jukebox Sorry I forgot to mention I did test on Auto, Gen1, Gen2 and Gen3 made no difference to the situation.

    @lablett Hmm I didn't try this yet (testing each ram stick) I'll run tests now removing one at a time and then refitting them in reverse order so each one is tested individually. I've not installed GPUZ but I will do and report back, thank you.
  • JukeboxJukebox Member Posts: 640 ✭✭✭
    Rigboy786 said:


    @Jukebox Sorry I forgot to mention I did test on Auto, Gen1, Gen2 and Gen3 made no difference to the situation.

    Just use Gen1.
    99% sure that high CPU load because of this.
  • Rigboy786Rigboy786 Member Posts: 45
    Quick Update:
    I ran a successful test on 1 GPU with zero hangs for 19 hours!
    I shutdown, connect 2nd GPU and the system has become unstable, what I have noticed is that the GPU that is now hanging is the one that ran for 19hrs and not the new one I've just connected!!! It runs fine at start and then after a while I see the screen blink (black screen) and when it comes back on after 1second, one GPU is down and the only task added to task manager is AMD ACP Binaries - so maybe a faulty riser? or still possibly ram mem fault with this type of failure?

    If the second GPU had a faulty riser, could that cause it to knock the 1st GPU out?
    This is so so frustrating :disappointed:


  • JukeboxJukebox Member Posts: 640 ✭✭✭
    AMD ACP = AMD Audio Co-Processor

    Turn off AMD HDMI audio devices in Device Manager - problem must disappear.
  • Rigboy786Rigboy786 Member Posts: 45
    Jukebox said:

    AMD ACP = AMD Audio Co-Processor

    Turn off AMD HDMI audio devices in Device Manager - problem must disappear.

    Disabled AMD ACP from Task Manager > GPU0 hangs
    Switched off MSI AB > GPU0 hangs
    Changed Riser on GPU0 > GPU0 hangs
    Changed Riser on GPU1 > GPU0 hangs
    Removed 4GB from Ram Slot 4 > GPU0 hangs
    Removed 4GB from Ram Slot 3 > GPU0 hangs
    Removed 4GB from Ram Slot 2 > GPU0 hangs
    Removed 4GB from Ram Slot 1 & Replaced with the one from Ram Slot 4 > GPU0 hangs

    Will now begin fitting back in reverse order.

    Note: My video output is coming through HDMI from GPU0 and GPU0 is the one that is hanging but that's also the one that passed a successful test of mining for 19 hours without issue + was also managing my video output then as well. It's only hanging now that I've installed a second GPU.
  • mchu3599mchu3599 Member Posts: 29
    You can't just plug another GPU in and run.

    Use DDU, uninstall all video drivers, shutdown.
    Install second GPU.
    Startup, re-install drivers, restart. Verify all GPUs are in the device manager.
    Start miner, test.

    This needs to be repeated for each GPU you want to add to the system.
  • Rigboy786Rigboy786 Member Posts: 45
    mchu3599 said:

    You can't just plug another GPU in and run.

    Use DDU, uninstall all video drivers, shutdown.
    Install second GPU.
    Startup, re-install drivers, restart. Verify all GPUs are in the device manager.
    Start miner, test.

    This needs to be repeated for each GPU you want to add to the system.

    OMG I didn't know... all my GPU's are running on the same driver, I assumed it would assign the driver automatically??

    It's currently running well on the 2 since I removed the 4th ram stick and replaced it, so theres just 1 ram stick install now from the 4 I had originally and so far hashed 20 shares on eth...

    If i have to DDU and reinstall driver each time I test different GPUs that is a nightmare lol
  • mchu3599mchu3599 Member Posts: 29
    The specific situations where driver re-install is absolutely necessary is too long to explain. It is just easier to repeat the process for all new GPU install. Driver conflict is one of the most common source of system instability.
  • Rigboy786Rigboy786 Member Posts: 45
    mchu3599 said:

    The specific situations where driver re-install is absolutely necessary is too long to explain. It is just easier to repeat the process for all new GPU install. Driver conflict is one of the most common source of system instability.

    I should have done this before testing each GPU, I think before adding the next one I will connect all 6 and install the driver to get that bit out the way first. Thank you!

    UPDATE:
    After removing my 4th Ram stick and replacing it with the first one I took out, the system is not crashing. It's so far running 3hrs on two GPU's mining Eth + Dcr in Claymore V10.0 Dual Mode.

    I will let it continue until the morning to give it around 20hrs before confirming the current config is okay and moving on.

    One step closer to fixing this :smile: Thank you everyone!
  • Rigboy786Rigboy786 Member Posts: 45
    edited March 2018
    Forgot to mention:

    Although I recently replaced my risers with the latest Version 008S I noticed today one of the contacts was half cut (missing) so I opened a new pack from another supplier and behold, the quality is much better!

    I have replaced the risers but beware of cheaper copies.. although they were priced the same, there is a clear different in quality. The print on the cheaper one is blotchy, half a contact is missing and the soldering on the flip side is a lot messier. Also, the newer ones have less movement in the slot.

    With the new cards, my two GPUs currently on test are achieving an extra 0.5 - 1 Mhs each and that's stable for the last 3.5hrs now.

    See images below, the top one is the old riser and the bottom is the newer riser. Check your risers! :smile:



  • lablettlablett Member Posts: 333 ✭✭
    You have to watch the risers as I have binned many. I even had some that were too big!

    Always inspect for missing contacts. Let us know how you get on.
  • iamnoobplzhelpiamnoobplzhelp Member Posts: 239 ✭✭
    Just FYI, after bios modding many of my cards, I have them run under stock clocks.

    I have some PowerColor Red Dragon RX 480s with Samsung Memory flashed with UberMix 3.1. At stock clocks, lots of errors. So many of them are running 1920-1980 MHz (2000 is stock). They all produce ~28.5 Mh/s Eth + ~850 Mh/s Decred that way.
  • eth_shiftyeth_shifty Member Posts: 52
    so it wound up being RAM? That's pretty odd.

    I would have thought it was excessive overclocking (81c is toastier than I would run anything)

    or excessive power draw on a single cable:
    even though you may have a big single rail on your psu, I have seen some people use a single pcie power cable to power three risers! Rule of thumb is if a cable is getting hot to the touch, it's going to bite you in the ***.

    In other news, that's a pretty good dual mining speed. Have you tried just mining eth and dropping the core clock speed down? Often that can save a lot of power = less heat.
  • Rigboy786Rigboy786 Member Posts: 45
    @iamnoobplzhelp I didn't mod my cards but i've read online that there is no benefit from modding the R9 290 and R9 290x for mining so I've assumed they are stock bios anyway.

    @eth_shifty I'm still testing mate, currently mining with 2 GPU's on 1 ram stick for the last 5hrs. I'm going to let it run overnight (5.30pm here now in UK) so in the morning if it's still going then I'll move on to adding the next GPU. - I always use a separate cable for each riser and each card and i never split between two PSUs... if the GPU is powered by PSU1 then i make sure the riser is also powered by PSU1.

    Right now with both cards mining dual mode i'm doing about 550-600w at the wall, everything powered on 1 PSU. Once I'm stable on all 6 cards running then I'll look at tweaking the clocks.
  • Rigboy786Rigboy786 Member Posts: 45
    @mchu3599 I did set my -tt to 60, -fanmin to 50 and -fanmax to 100 in config and do away with MSI AB for this current test and it's running 6hrs+ now, I will switch it off in morning if it manages to stay on without hanging but my fans are both running at 100% and have been since it started mining... reason is because -tt 60 is too low for these cards. They run 75-80c normally.

    Do you think it's safe to leave fans 100% overnight? I can then adjust the temp to -tt 75 on the next test tomorrow, is that ok you think? I'm worried my fans might pack up lol
  • lablettlablett Member Posts: 333 ✭✭
    edited March 2018
    If you want your cards to last then run them around 75c, but you also need to reduce your fan speed to 75%. Also you might just need to increase the airflow around your cards or/and increase the distance. Can you please take a picture of your rig ? I would not the fans at 100%, it would be safe, but you may damage them for the long term. Mining is a long term thing and remember this!
  • mchu3599mchu3599 Member Posts: 29
    Rigboy786 said:

    @mchu3599 I did set my -tt to 60, -fanmin to 50 and -fanmax to 100 in config and do away with MSI AB for this current test and it's running 6hrs+ now, I will switch it off in morning if it manages to stay on without hanging but my fans are both running at 100% and have been since it started mining... reason is because -tt 60 is too low for these cards. They run 75-80c normally.

    Do you think it's safe to leave fans 100% overnight? I can then adjust the temp to -tt 75 on the next test tomorrow, is that ok you think? I'm worried my fans might pack up lol

    If your cards are running that high stock the only way to lower it is to under volt and under clock the core, and under volt the memory as well. if set up correctly you should be able to run the cards at below 70 with 60-70% fan.
  • Rigboy786Rigboy786 Member Posts: 45
    @lablett I agree 75c is what I think is ideal and anything lower is a bonus. See image below, I've got it set up in my kitchen right now just so that I can correct all the stability issues and then it will be in my workshop in an unused office which is very airy. In the morning I'll definitely adjust the config to -tt 75 but do you think running it for another 12hrs will do damage at 100% fan? It's been running 10hrs straight at 100% fan speed so far... I want to run it the full 22hrs before switching off as this will confirm to me it won't fail. It would usually hang anywhere between 10min - 10hrs.

    @mchu3599 These cards always run at around 75ish and are happy to go up to 95, although I personally don't want my cards running at 95. My workshop is very cold and I will have AC set up there so the current temps are just temporary :smile:

    Since talking to you lot I've got hope again that this problem can be fixed... two days straight very stable hashing... tomorrow I will connect the third GPU :smiley:


  • Ericjh801Ericjh801 Utah, USAMember Posts: 370 ✭✭
    Personally I run all my cards set at 65c. My rig setup allows outside air to come in so that helps but mine are never > 50% fans. I have 2 ASIC's sitting there too which don't help the inside air but I have an exhaust system to keep air out. I'll have to post pictures sometime when i'm done. :) The lower the temp and the lower you can keep fan speeds the more life you will get out of your cards.
  • mchu3599mchu3599 Member Posts: 29
    Sounds like you are on your way to stable mining, congrats. Once you get all the cards added and mining stable, start optimizing the power and mem clock to get the best hash/watt.
Sign In or Register to comment.