I'm posting this here a Postmortem of my EthOS 1.3.1 10 GPU Rig that started having random lockup/crashes of entire rig.
I hope this helps someone else the support guys for EthOS were extreme helpful.
So the rig is a 10 GUP setup with RX580 8GB cards. A mix of powercolor and MSI. It was running fine since February when about 5 days ago the rig would randomly crash. No SSH and no GUI. At first I just rebooted and it would last maybe an hour. Other times only a few minutes.
I started trouble troubleshooting thinking it was a PCI-E riser. It seemed like 3 were bad so I order 12 risers to replace them all and have 2 spare.
It did not help. Still crashed.
Then I thought It could be the USB bus, so I moved it to a different ports. Still crashed.
Then I thought the USB media went bad and setup an SSD drive. Still crashed.
I tried booting to each GPU one at a time. Still crashed.
I ready some where it could be the PSU but could not see how it could be the case here. I'm running one 1300W and one 750W.
I ended up contacting Ethos support on an unrelated issue. The system would not boot the SSD on 1.3.1 OS image. I was told to put it in legacy mode.
Then the GUI would not load and we worked on that for a bit and just fell back to using SSH to configure.
After we had it setup and hashing, The support tech was not pleased with my hashing rate and provided instructions on BIOS moding. After that they reviewed the configuration and gave me some new setting for mem ( memory overclocking) and vlt ( Volts for each GPU). I made their suggested changes. The system was running better went from 272 mh/s to 300 mh/s with setting. However the system was still crashing randomly.
The tech had me lower the vlt more. Actually slightly below 950. He was thinking it was the PSU not being able to support the power needs.
It looks like he was right after making the = to 'vlt hostname 940 920 920 950 950 940 940 950 920 950', it seems to have reduced the power needs to where the rig has been running for over 21 hours without a crash and my total wattage is down to 1,169W from about 1,499W.
The moral of the story is that EthOS support is great and Taking time out to tune your rig can have far reaching benefits.
Here is a copy of EthOS configuration for anyone who's interested. Your configuration may require tuning to work.
flags --cl-global-work 8192 --farm-recheck 200
vlt [hostname] 940 920 920 950 950 940 940 950 920 950
mem 80e886 2000 2100 2050 2050 2125 2125 2050 2125 2100 2050
You'd need to change the wallet and vlt [hostname] to be your rig hostname.