常见的hpc操作

获取分区和节点信息

sinfo

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*         up 1-00:00:00      1  alloc compute29
chen           up   infinite      2    mix chen[3-4]
chen           up   infinite      1  alloc chen1
chen           up   infinite      2   idle chen[2,5]
argon          up   infinite      1    mix argon9
argon          up   infinite      1  alloc argon3
argon          up   infinite     21   idle argon[1-2,4-8,10-22],argpu1
parallel       up 20-00:00:0      2  inval compute[44,48]
parallel       up 20-00:00:0      2    mix compute[25,28]
parallel       up 20-00:00:0      6  alloc compute[29-30,32,37-38,50]
parallel       up 20-00:00:0     19   idle compute[24,26-27,31,33-36,39-43,45-47,49,51-52]
gpu            up 10-00:00:0      1    mix gpu7
gpu            up 10-00:00:0      1   idle gpu6
chem           up   infinite      2  alloc gpu[8-9]
sci            up   infinite      1    mix gpu10
sci            up   infinite      1  alloc gpu11
felix          up   infinite      3   idle felix[1-3]
aquila         up 7-00:00:00     12    mix agpu[2,4-5,7-9],aquila[1-6]
aquila         up 7-00:00:00      3   idle agpu[1,3,6]
chemcourses    up 7-00:00:00      1  down* gpu5
chemcourses    up 7-00:00:00      1  alloc gpu4
hetao          up   infinite      1   idle hetao1
bioclass       up 2-00:00:00      2   idle compute27,hetao1

获取节点的信息

注意到上面有个gpu5的节点显示为down,可以利用下面的命令查看下到底怎么了

scontrol show node gpu5

NodeName=gpu5 CoresPerSocket=6 
   CPUAlloc=0 CPUEfctv=12 CPUTot=12 CPULoad=N/A
   AvailableFeatures=gpu,2643v4,1080TI
   ActiveFeatures=gpu,2643v4,1080TI
   Gres=gpu:10
   NodeAddr=10.214.97.44 NodeHostName=gpu5 
   RealMemory=386000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=DOWN+NOT_RESPONDING ThreadsPerCore=1 TmpDisk=0 Weight=300 Owner=N/A MCS_label=N/A
   Partitions=chemcourses 
   BootTime=None SlurmdStartTime=None
   LastBusyTime=2023-04-17T19:07:50
   CfgTRES=cpu=12,mem=386000M,billing=12,gres/gpu=10
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm@2023-02-07T13:54:48]

显示原因是没有反应。

这个时候,可以尝试登录到gpu5这个节点,然后检查下slurmd服务是否正常。 有必要的话,重启该服务。 systemctl restart slurmd

获取硬件信息

sinfo -Nel

这里的S:C:T分别表示socket,cores,threads。 以agpu1为例,它显示2:10:1,表示它有2个CPU,每个CPU有10个核心,每个核心都是单线程的。

Tue Apr 18 14:21:15 2023
NODELIST   NODES   PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
agpu1          1      aquila        idle 20     2:10:1 128490        0    300 gpu,2630 none                
agpu2          1      aquila       mixed 20     2:10:1 257500        0    300 gpu,2630 none                
agpu3          1      aquila        idle 20     2:10:1 257500        0    300 gpu,2630 none                
agpu4          1      aquila       mixed 40     2:20:1 385000        0    400 gpu,g623 none                
agpu5          1      aquila       mixed 40     2:20:1 385000        0    400 gpu,g623 none                
agpu6          1      aquila        idle 40     2:20:1 191000        0    400 gpu,g623 none                
agpu7          1      aquila       mixed 64     2:16:2 385000        0    400 gpu,g622 none                
agpu8          1      aquila       mixed 32     2:16:1 385000        0    400 gpu,g622 none                
agpu9          1      aquila       mixed 112    2:28:2 510000        0    400 gpu,g633 none                
aquila1        1      aquila       mixed 40     4:10:1 206380        0    200 cpu,4820 none                
aquila2        1      aquila       mixed 28     2:14:1 256550        0    100 cpu,2680 none                
aquila3        1      aquila       mixed 28     2:14:1 256550        0    100 cpu,2680 none                
aquila4        1      aquila       mixed 28     2:14:1 256550        0    100 cpu,2680 none                
aquila5        1      aquila       mixed 40     2:20:1 286550        0    100 cpu,g614 none                
aquila6        1      aquila       mixed 40     2:20:1 191000        0    100 cpu,g614 none                
argon1         1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon2         1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon3         1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon4         1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon5         1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon6         1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon7         1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon8         1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon9         1       argon       mixed 24     2:12:1 191000        0    100 cpu,g614 none                
argon10        1       argon        idle 24     2:12:1 191000        0    100 cpu,g614 none                
argon11        1       argon        idle 24     2:12:1 191000        0    100 cpu,g614 none                
argon12        1       argon        idle 24     2:12:1 191000        0    100 cpu,g614 none                
argon13        1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon14        1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon15        1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon16        1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon17        1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon18        1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon19        1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon20        1       argon        idle 28     2:14:1 191000        0    400 cpu,g613 none                
argon21        1       argon        idle 96     4:24:1 384936        0    400 cpu,9242 none                
argon22        1       argon        idle 96     4:24:1 384936        0    400 cpu,9242 none                
argpu1         1       argon        idle 40     2:20:1 239000        0    400 gpu,s431 none                
chen1          1        chen        idle 40     2:20:1 384963        0    200 cpu,g614 none                
chen2          1        chen   allocated 80     4:20:1 772000        0   1000 cpu,g614 none                
chen3          1        chen       mixed 96     4:24:1 772000        0    100 cpu,g633 none                
chen4          1        chen       mixed 96     4:24:1 772000        0    400 gpu,g633 none                
chen5          1        chen   allocated 96     4:24:1 772000        0    400 gpu,g633 none                
compute24      1    parallel        idle 24     2:12:1 257500        0     10 cpu,2650 none                
compute25      1    parallel       mixed 24     2:12:1 257500        0     10 cpu,2650 none                
compute26      1    parallel        idle 24     2:12:1 257500        0     10 cpu,2650 none                
compute27      1    bioclass        idle 24     2:12:1 257500        0     10 cpu,2650 none                
compute27      1    parallel        idle 24     2:12:1 257500        0     10 cpu,2650 none                
compute28      1    parallel       mixed 40     2:20:1 191000        0    100 cpu,g614 none                
compute29      1      debug*   allocated 40     2:20:1 385000        0    100 cpu,g614 none                
compute29      1    parallel   allocated 40     2:20:1 385000        0    100 cpu,g614 none                
compute30      1    parallel   allocated 40     2:20:1 385000        0    100 cpu,g614 none                
compute31      1    parallel   allocated 40     2:20:1 385000        0    100 cpu,g614 none                
compute32      1    parallel   allocated 40     2:20:1 385000        0    100 cpu,g614 none                
compute33      1    parallel        idle 28     2:14:1 191000        0    400 cpu,g613 none                
compute34      1    parallel        idle 28     2:14:1 191000        0    400 cpu,g613 none                
compute35      1    parallel        idle 28     2:14:1 191000        0    400 cpu,g613 none                
compute36      1    parallel        idle 28     2:14:1 191000        0    400 cpu,g613 none                
compute37      1    parallel   allocated 40     2:20:1 770600        0    400 cpu,g624 none                
compute38      1    parallel        idle 40     2:20:1 770600        0    400 cpu,g624 none                
compute39      1    parallel        idle 40     2:20:1 770600        0    400 cpu,g624 none                
compute40      1    parallel        idle 40     2:20:1 770600        0    400 cpu,g624 none                
compute41      1    parallel        idle 40     2:20:1 770600        0    400 cpu,g624 none                
compute42      1    parallel        idle 40     2:20:1 770600        0    400 cpu,g624 none                
compute43      1    parallel        idle 40     2:20:1 708676        0    400 cpu,g624 none                
compute44      1    parallel       inval 40     2:20:1 770600        0    400 cpu,g624 Low RealMemory (repo
compute45      1    parallel        idle 40     2:20:1 770600        0    400 cpu,g624 none                
compute46      1    parallel   allocated 40     2:20:1 770600        0    400 cpu,g624 none                
compute47      1    parallel        idle 40     2:20:1 770600        0    400 cpu,g624 none                
compute48      1    parallel       inval 40     2:20:1 770600        0    400 cpu,g624 Not responding      
compute49      1    parallel        idle 40     2:20:1 770600        0    400 cpu,g624 none                
compute50      1    parallel   allocated 40     2:20:1 770600        0    400 cpu,g624 none                
compute51      1    parallel        idle 32     2:16:1 384936        0    400 cpu,6226 none                
compute52      1    parallel        idle 32     2:16:1 384936        0    400 cpu,6226 none                
felix1         1       felix        idle 96     4:24:1 770600        0    400 cpu,6330 none                
felix2         1       felix        idle 96     4:24:1 770600        0    400 cpu,6330 none                
felix3         1       felix        idle 96     4:24:1 770600        0    400 cpu,6330 none                
gpu4           1 chemcourses   allocated 12      2:6:1 386000        0    300 gpu,2643 none                
gpu5           1 chemcourses       down* 12      2:6:1 386000        0    300 gpu,2643 Not responding      
gpu6           1         gpu        idle 20     2:10:1 128490        0    300 gpu,2630 none                
gpu7           1         gpu       mixed 40     2:20:1 770000        0    400 gpu,g623 none                
gpu8           1        chem   allocated 32     2:16:1 128300        0    400 gpu,g521 none                
gpu9           1        chem   allocated 32     2:16:1 128300        0    400 gpu,g521 none                
gpu10          1         sci       mixed 52     2:26:1 256500        0    400 gpu,g532 none                
gpu11          1         sci       mixed 56     2:28:1 510000        0    400 gpu,g633 none                
hetao1         1       hetao        idle 28     2:14:1 127229        0    600 cpu,2660 none                
hetao1         1    bioclass        idle 28     2:14:1 127229        0    600 cpu,2660 none 
| 访问量:
Table of Contents