[SLURM] [AI Cluster] emergency maintenance

Colin Hudler chudler at cs.uchicago.edu
Thu Sep 16 09:46:40 CDT 2021


One of the GPU nodes has developed a fault in one of the 2080 cards. The faulty module has been removed, because it cannot function. Investigation into that is ongoing. The system (a003) will be running with 3 GPUs until it is restored. Due to this, two jobs were terminated and re-queued on both Tuesday and Wednesday. There should be no further problems right now, but please contact us if you notice something.


More information about the Slurm mailing list