[SLURM] AI Cluster Maintenance February 24 - March 31
Colin Hudler
chudler at cs.uchicago.edu
Wed Feb 19 09:07:38 CST 2025
Hello,
CS Techstaff will perform rolling maintenance on each node in the AI Cluster starting next week. We're doing one host at a time and testing as we go. We expect minimal disruption to job scheduling. The work involves draining a host, modifying it physically, and returning it to the cluster within 30 minutes. We plan to do a few per day for several weeks and then increase the frequency until completion. We expect to be done by the last week of March. If all is well, on or around March 15, we will declare a total cluster outage for 4 hours in the early morning while we perform maintenance on the core systems (file server, database, control node).
We are sensitive to disrupting research, and I would like to hear from you if our schedule is an unacceptable risk. It may be near a deadline, or something you have planned/is ongoing. Once this maintenance is over, the cluster network will be faster and more efficient.
Sincerely,
Colin Hudler
More information about the Slurm
mailing list