Tangent Unscheduled Downtime - Hardware Failure
Date Posted: July 11th, 2016
Tangent was restored to services July 15th. Jobs that were idle in the batch queue before the hardware issue are now running and users can now submit new jobs.
Due to a hardware failure on the tangent gear, we have turned off the resource manager - therefore any slurm scheduler command will time out and give a “Unable to contact slurm controller (connect failure)” response. Currently running jobs are OK, and will finish unless the nodes have to be rebooted to fix the problem.
Once the hardware issue has been resolved we will restart the resource manager and restart slurm.