**SLURM** Kalyani Gadgil Cluster management and job scheduling system # Introduction * Open Source, highly-scalable * Three key functions : 1. Allocates access to resources to users for some duration of time so they can perform work 2. Provides framework for starting, executing, monitoring work on set of allocated nodes 3. Arbitrates contention for resources by managing queue of pending work # Slurm components - `slurmctld` is centralized manager - Each node has a `slurmd` daemon which is like remote shell, waits for work, executes work, returns status, waits for more work - Administrative tools called `sacctmgr` used to manage database like identiyf clusters, valid users. ![Slurm Components](slurm_components.png) # Slurm entities - compute resources are nodes - partitions group nodes into logical sets - jobs or allocations of resources assigned to user for specified amount of time - job steps are sets of tasks within a job ![Slurm entities](slurm_entities.png) # Super Quick Start * Install MUNGE is an authentication service for creating and validating credentials. * `tar --bzip -x -f slurm.bz2` * `./configure --prefix= --sysconfdir=` in directory * `make` then `make install` * Build a configuration file using your favorite web browser and `doc/html/configurator.html` * `ldconfig -n ` so that the Slurm libraries can be found by applications that intend to use Slurm APIs directly * Install the configuration file in /slurm.conf on ALL NODES OF CLUSTER # Daemons - `slurmctld` orchestrates Slurm # Authentication * All communications between Slurm components are authenticated * Currently, only supports munge which requires installation of MUNGE package * When using MUNGE, all nodes in the cluster must be configured with the same munge.key file. # Computer Nodes * 1 config file * contain these are the nodes, these are the ip addr, mem, GPUs. 1 config file. * Qs : how to make these partitions when you have multiple radios on 1 network interface * VM's are resources to SLURM? we dont want static list of radios per node. * what level do we define resources? # Features * QTR has 4 radios * QTD * trackable resources # Interesting things * shared node policy * cgroups - scheduled only for users /proc/sys/cgroup * sessions? - im a user, im in a grp, sys knows in a grp, job running on such-such radios. can other members of the group join/control/view # Questions * What are resources? Radios or VMs? VMs * How to club compute nodes together * ESX servers physically have 4 ethernet ports and then they are made available to VMs. Radios available via fast switches not physically connected. * prolog script edits firewalls script and then go * pool of VMs. min and max. # NOTES >"SLURM questions: >how are jobs set up >will slurm accept ip addr of radios and how will it group them >what interface will it need?" 9 HPC computers. ESXi vmware. 2 things. scheduler and resource manager, converged with slurm. 24 ports out of 96 is for radios. each computer has 4 network interfaces and 3 for radios. difficulty locking down resources? dynamic shared object modules, similar to gnurdio. prolog and uplog scripts - ask for resource alloc, once the resource given, shell script can be run which can be dynamic. can work with mqtt but this compromises security? prelog and uplog scripts are system scripts. leftover code removed by uplog. what are resources? features written and extended, write rules, env var or something prelog uplog, these features requested. slurm needs to be installed on each VMs. slurmd is needed for client. slurmctld on a specialized controller. 2 radios, 1 VM, 1 network interface : routing table on computer has for destination nw use this kind of interface. in cascades, it has multiple network interface, ethernet and infiniband. then you can write ip add per network interface. [ESXI vmware hypervisor](https://www.vmware.com/products/esxi-and-esx.html)