[SLURM] [AI Cluster] emergency maintenance
Colin Hudler
chudler at cs.uchicago.edu
Thu Sep 16 09:46:40 CDT 2021
One of the GPU nodes has developed a fault in one of the 2080 cards. The faulty module has been removed, because it cannot function. Investigation into that is ongoing. The system (a003) will be running with 3 GPUs until it is restored. Due to this, two jobs were terminated and re-queued on both Tuesday and Wednesday. There should be no further problems right now, but please contact us if you notice something.
More information about the Slurm
mailing list