Rig shutting down after few hours, worked great for months

asusrigasusrig Member Posts: 141
edited February 2018 in Mining
I have a 13 GPU rig that has worked good for months now in my home office. Rig Specs:

7 x 1070 TI EVGA SC mining ZEC. CHIP OC 200 MHZ, MEMORY OC 700 MHZ
6 x RX580 SAPPHIRE NITRO+ 8GB mining ETH. BIOS mod.
2 x 1000W EVGA Gold rated PSU
2 x 750W EVGA Gold rated PSU
Asus Mining Expert MB w/ 512 SSD and 8 GB DDR4
Windows 10

I have two APC UPS backups connected to this rig. Each UPS is rated up to 865W. I have the PSUs paired in two groups with a 1000W + 750W powering the Nvidia cards and a 1000W + 750W powering the AMD cards. The Nvidia cards and system is using about 865 watts and the AMD cards are also using about 865 watts.

Earlier today I noticed the LED of the UPS powering the Nvidia cards come on and a timer below showing how many minutes were left indicating power loss. The minutes kept increasing and I realized the cards were not drawing as much power. I could not RDP into the rig and Claymore Manager confirmed that the AMD cards were also offline. The fans of the AMD cards were still spinning but the fans of the Nvidia cards stopped spinning. All case fans and CPU fans were spinning. After about a minute the rig shutdown on its own.

I turned the rig back on and it was able to mine again for about 2 hours before the same process repeated. I turned it on again and this time reduced the CHIP OC of the Nvidia cards to 150 MHZ and the memory down to 650 MHZ. However as mentioned above the rig was working fine for months with the higher overclocks. Reducing the values did not impact performance much with the ZEC miner so I might do this with my other rigs to reduce stress on the cards.

Another thing I did this time was disable ECO mode on all the power supplies (stabbing in the dark, not sure if that is the culprit). There is an ECO switch on each PSU. If it is set to on the PSU fan will not come on until the temperature of the PSU has reached I believe 55 C.

So far the system has been running fine for about 1/2 hour since these changes but only time will tell.

Anyone have an idea why the rig is displaying this behavior? I looked at System Events but really could not figure out what the reason was for the crash and eventual shutdown since a monitor is not hoked up to the system. Temperatures appear to be fine in the room and the cards are not running hot.

Comments

  • cjclm7cjclm7 Member Posts: 77
    just a suggestion: can you run your rig for a few days without the two APC UPS?
  • theonektheonek Member Posts: 94
    and the other things - risers and melted psu cables, you have to check them...
  • asusrigasusrig Member Posts: 141
    edited March 2018
    Thanks for the suggestions. I'll check the cables. The rig has not shutdown ever since I made the changes above. However today a new problem with this rig which I fixed.

    I was sitting next to the rig and suddenly both UPS devices came on, as if there was a power outage. Each UPS is on a different 15A circuit, one which is grounded (1070 ti cards) and another which is not grounded (RX580 cards). This part of the house has older 15a rated wiring so the rig is powered from two circuits to reduce load. I have a Zero Surge device plugged into the ungrounded circuit to provide surge protection. Without it the UPS would not be able to provide surge protection since grounding is required for this.

    https://zerosurge.com/plug-in-products-solutions/

    The rig stayed on but the fans started spinning very fast and were loud. I think it was the RX580 fans but not positive because HWinfo did not report faster RPMS. Maybe it was the PSU fans. I logged on to the rig and Claymore was still mining with the RX580 cards.

    The 1070 ti cards were also mining ZEC with the DSTM miner however three of the cards failed in the miner and when I tried to check their stats with HWinfo they were not listed. They did show up in Device Manager and ASUS Mining Manager. I checked Afterburner and those three cards would not report any fan speeds / power / memory settings and they were unlinked from the other four cards. I rebooted the machine and they were still appearing in Device Manager but Afterburner and HWinfo would still not recognize their sensors.

    I decided to download the latest Nvidia driver and did a clean installation which first removed everything and after a reboot none of the Nvidia cards were listed in Device Manager. I did another clean installation and after it rebooted everything is working properly again.

    I really don't know what caused this latest incident. The UPS Backups are maxed out in terms of load since they can only provide protection for up to 865 watts each and I am using every bit of that. According to the APC monitoring software the rigs can stay powered on for 3 minutes when mining which will be enough for mini power outages of split seconds or a few seconds.
  • eth_shiftyeth_shifty Member Posts: 52
    So maybe I'm just paranoid, but I wouldn't use the cheap UPS devices. If yours are not the cheap-o's, ignore my rant :) Why? They are not "cleaned" power always running off of the battery. What you have instead is a transfer switch that, although pretty fast, is not instantaneous. Basically, it's a dual-duty trickle charger and surge suppressor that will relay to the battery if a failure event occurs. Better backups will be on "clean" battery power all of the time. They are the only ones I trust, but of course, probably aren't worth it. I've just seen the cheaper backup units fry some pretty expensive stuff.

    Also, I would never run a non-grounded power supply, especially with a properly grounded one in the mix. Your ground should be a reference for all of your 0's and also a drain wire contingency. If you have different potentials to ground on two sources, you will typically find a constant flow on your ground wire, which is not good.

    Clean power is the basis for any well operating machine!
  • Ericjh801Ericjh801 Utah, USAMember Posts: 371 ✭✭✭
    APC is 100% not a cheap-o brand. I've been a reseller for them for many years and their products are pretty top of the line. I agree with eth_shifty, clean power is a good thing...though the APC units should keep it pretty clean for you. Which model of APC are you using?
  • JukeboxJukebox Member Posts: 640 ✭✭✭
    edited March 2018
    Too much PSU's. 2*750+1*1000 W is more than enough for that 13 cards
  • Ericjh801Ericjh801 Utah, USAMember Posts: 371 ✭✭✭
    Too many usually isn't an issue though. Too much is better than not enough. :)
  • JukeboxJukebox Member Posts: 640 ✭✭✭
    Ericjh801 said:

    Too many usually isn't an issue though. Too much is better than not enough. :)

    Simple things are more reliable.
    This thread proves that you have issues. :)
    And maybe the reason is faulty PSU.

    Start to check all power cable connectors, where they connected to risers and PSUs. Some of 12V rails can be damaged by overheating. Be very careful - EVGA cables are all black, so it's often not easy to find melted or damaged wire at first sight. If you using SATA-to-Molex cables, check them too. Yellow wire is 12V rail.
  • lablettlablett Member Posts: 333 ✭✭
    edited March 2018
    I think your issue is either PSU or cable not connecting well. I have just had an issue with rig, running fine for months, and yesterday it started shutting down. So I replaced the PSU and she is working fine again.

    I have experienced a similar issue in the past and just removed all the cables and reattached. In this case the everything worked fine.

    If you have a good PSU then they tend to shutdown if there is a bad connection, which is good.

    First step, and this will be a pain, remove all the cables and reattach them.


    Post edited by lablett on
  • asusrigasusrig Member Posts: 141
    edited March 2018

    So maybe I'm just paranoid, but I wouldn't use the cheap UPS devices. If yours are not the cheap-o's, ignore my rant :) Why? They are not "cleaned" power always running off of the battery. What you have instead is a transfer switch that, although pretty fast, is not instantaneous. Basically, it's a dual-duty trickle charger and surge suppressor that will relay to the battery if a failure event occurs. Better backups will be on "clean" battery power all of the time. They are the only ones I trust, but of course, probably aren't worth it. I've just seen the cheaper backup units fry some pretty expensive stuff.

    Also, I would never run a non-grounded power supply, especially with a properly grounded one in the mix. Your ground should be a reference for all of your 0's and also a drain wire contingency. If you have different potentials to ground on two sources, you will typically find a constant flow on your ground wire, which is not good.

    Clean power is the basis for any well operating machine!

    I agree ideally all circuits would be grounded and if it was feasible I would add a ground to the older circuit but it would mean ripping open walls etc. and there is not a basement either under that part of the house - it is slab on grade.

    So far it is working out though and the rig has not shut down again.

    For the two rigs in the basement I installed two 20a grounded circuits in surface mounted conduit a few months ago. These have GFCI breakers (code requirement for unfinished basements). Those have also worked fine however high humidity will trip the breakers so I have dehumidifiers on the side where the rigs are.

    I have six of these UPS Backups connected to three rigs:

    http://www.apc.com/shop/us/en/products/APC-Power-Saving-Back-UPS-Pro-1500/P-BR1500G

    Two days ago the two power supplies that power the RX580 cards for one rig (on grounded 20a circuit) in the basement were off and so were the cards. For that rig I have these two power supplies plugged into a APC 1500 VA and a slightly smaller capacity Cyberpower UPS. On the lower case of the rig I have seven 1070 ti cards powered by two more power supplies connected to another 1500 VA APC UPS. These cards were still on. So I rebooted the rig and all rigs have been working fine since.
  • asusrigasusrig Member Posts: 141
    lablett said:

    I think your issue is either PSU or cable not connecting well. I have just had an issue with rig, running fine for months, and yesterday it started shutting down. So I replaced the PSU and she is working fine again.

    I have experienced a similar issue in the past and just removed all the cables and reattached. In this case the everything worked fine.

    If you have a good PSU then they tend to shutdown if there is a bad connection, which is good.

    First step, and this will be a pain, remove all the cables and reattach them.


    Not sure it is that. I have good quality PSUs and the rig has not shutdown again. I am really not sure yet what the cause could have been and there could be so many reasons. We had a lot of rain and perhaps water got into the transformer outside and the voltage dropped causing issues in the house? Who knows?
  • mchu3599mchu3599 Member Posts: 29
    asusrig said:



    Not sure it is that. I have good quality PSUs and the rig has not shutdown again. I am really not sure yet what the cause could have been and there could be so many reasons. We had a lot of rain and perhaps water got into the transformer outside and the voltage dropped causing issues in the house? Who knows?

    You can find out easy enough, get a Killawatt meter and check your voltage stability at the wall.

  • Ericjh801Ericjh801 Utah, USAMember Posts: 371 ✭✭✭
    I'm pretty sure those APC units can tell him how much they are drawing.
  • asusrigasusrig Member Posts: 141
    Thanks for all the replies. I think you guys are right, it may be the UPS units. Two days ago the rig in the basement was offf again - the upper tower powering the RX580 cards as described above. The smaller Cyberpower UPS alarm (old UPS btw with battery never replaced) was going off and indicated 0 V. The 1070 ti cards were still on but I could not remote into the rig.

    I shut it off and went to the electrical panel and flipped the two gfci breakers for the two rigs in the basement. The UPS backups for the rig still running kicked in however the one powering the RX580 cards of that rig was humming and would not stop once I flipped the breaker back on again. After a minute the entire rig also shutdown leaving both rigs off in the basement.

    Since the RX580 cards are drawing around 865 watts which is the max of the APC units the backups cannot really be effective in case of a small power outage. So I may add a few more backups to each rig to spread the load. A Tesla wall would be great but too expensive.

    We don't get too many power outages here to justify going crazy with batteries and generators etc.'

    The rigs are all running and the moment without issues.
Sign In or Register to comment.