CHPC OUTAGE: No Remote Access to CHPC Resources - UPDATED
Date Published: August 23, 2023
8/24/2023 at 7:30pm
The /scratch/ucgd/lustre is back online, mounted on all redwood compute and interactive
nodes.
In addition, rw[079,106,114,119,171,020,022,132,202] have been returned to service, leaving rw[010,187], with memory issues, and rw[072,134, 183] as the redwood nodes
that remain down.
8/23/2023 at 5:45pm
Redwood is back in service:
- All of the interactive nodes are up.
- /scratch/ucgd/lustre is still down. DDN Support has been engaged to work on the issues; for now the mount of this space has been removed from all nodes
- There are a number of compute nodes that are still being worked on:
- rw[010,106,114,119,171,187,020,022,132,202] - memory issues are being reported, so doing additional testing overnight
- rw[079,072,134,183] - these nodes are either not responding or have other issues that will require further work to diagnose.
8/23/2023 at 10:00am
Most, but not all, of the CHPC resources are once again accessible. The notable resources
that are not ready for use are the redwood cluster, including the interactive nodes,
and the PE /scratch/ucgd/lustre file system.
At this time CHPC staff is working on identifying additional resources that are not
accessible and working to bring them back online. Once we complete this process, we will send out a notification of the resources that
need additional work.
If you notice any other CHPC resource that is not accessible, please send a report
to helpdesk@chpc.utah.edu
8/22/2023 at 6:45pm
At about 3pm there was a widespread disruption of campus IT services that is being
attributed to humidity issues in the datacenter. You can monitor the current status
at https://uofu.status.io/ At this time the is no estimated time for resolution.
These issues resulted in an outage for remote access to CHPC resources. The outage will continue until the campus level event has been addressed. CHPC staff
has been actively working to identify the impact on CHPC hardware, and so far we have
determined issues with some network equipment in the PE. We are working with support
to get those addressed.
Our current view is that once the campus issues are fully sorted that the general
environment should be in good shape, but the PE depends on us addressing the issues
we have found.