[SLURM] Re: Outage Thursday May 3 | Adding GPU servers!!!

Phil Kauffman via Slurm slurm at mailman.cs.uchicago.edu
Thu May 3 16:52:40 CDT 2018


I am at a point where everything should be functional again. I'm sure something will crop up in the next couple days that will need to be addressed so let me know if you find anything.

I will be posting more extensive documentation over on how.cs.uchicago.edu once I get some bugs ironed out.

There is one potential bug I am already aware of and will try to track down tomorrow. I'll keep this list updated.

For now here is the TL;DR.

The '--gres' (man srun) is required if you want to make use of a gpu.

  --gpu=gpu:N    # where 'N' is the number of GPUs requested.
                 # Please try to limit yourself to one GPU per person.


Example when using tensorflow:

Give the file 'f':   Depends on:     pip3 install --user tensorflow-gpu
    export PATH=$HOME/.local/bin:$PATH
  <snip>
    #!/usr/bin/env python3
    from tensorflow.python.client import device_lib
    print(device_lib.list_local_devices())
  </snip>


  # Here we can see that no GPU was allocated to us because we did not specify the '--gres' option
  kauffman3 at bulldozer:~$ srun -p titan --pty /bin/bash
  kauffman3 at gpu3:~$ ./f 2>&1 | grep physical_device_desc
  kauffman3 at gpu3:~$
 
  # If we request only 1
  kauffman3 at bulldozer:~$ srun -p titan --pty --gres=gpu:1 /bin/bash
  kauffman3 at gpu3:~$ ./f 2>&1 | grep physical_device_desc
  physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1"

  # if we request 2 gpus
  kauffman3 at bulldozer:~$ srun -p titan --pty --gres=gpu:2 /bin/bash
  kauffman3 at gpu3:~$ ./f 2>&1 | grep physical_device_desc
  physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1"
  physical_device_desc: "device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:1a:00.0, compute capability: 6.1"

  # if we request more gpus then are available:
  kauffman3 at bulldozer:~$ srun -p titan --pty --gres=gpu:5 /bin/bash
  srun: error: Unable to allocate resources: Requested node configuration is not available


#GPU2
The pascal partition / gpu2 should work as per usual since it is not shared and gives you exclusive access. You will be required to specify the '--gres' option, however.



-- 
Phil Kauffman
Systems Administrator
Dept. of Computer Science
University of Chicago
kauffman at cs.uchicago.edu
773-702-3913



More information about the Slurm mailing list