常见的hpc操作
获取分区和节点信息
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up 1-00:00:00 1 alloc compute29
chen up infinite 2 mix chen[3-4]
chen up infinite 1 alloc chen1
chen up infinite 2 idle chen[2,5]
argon up infinite 1 mix argon9
argon up infinite 1 alloc argon3
argon up infinite 21 idle argon[1-2,4-8,10-22],argpu1
parallel up 20-00:00:0 2 inval compute[44,48]
parallel up 20-00:00:0 2 mix compute[25,28]
parallel up 20-00:00:0 6 alloc compute[29-30,32,37-38,50]
parallel up 20-00:00:0 19 idle compute[24,26-27,31,33-36,39-43,45-47,49,51-52]
gpu up 10-00:00:0 1 mix gpu7
gpu up 10-00:00:0 1 idle gpu6
chem up infinite 2 alloc gpu[8-9]
sci up infinite 1 mix gpu10
sci up infinite 1 alloc gpu11
felix up infinite 3 idle felix[1-3]
aquila up 7-00:00:00 12 mix agpu[2,4-5,7-9],aquila[1-6]
aquila up 7-00:00:00 3 idle agpu[1,3,6]
chemcourses up 7-00:00:00 1 down* gpu5
chemcourses up 7-00:00:00 1 alloc gpu4
hetao up infinite 1 idle hetao1
bioclass up 2-00:00:00 2 idle compute27,hetao1
获取节点的信息
注意到上面有个gpu5的节点显示为down,可以利用下面的命令查看下到底怎么了
scontrol show node gpu5
NodeName=gpu5 CoresPerSocket=6
CPUAlloc=0 CPUEfctv=12 CPUTot=12 CPULoad=N/A
AvailableFeatures=gpu,2643v4,1080TI
ActiveFeatures=gpu,2643v4,1080TI
Gres=gpu:10
NodeAddr=10.214.97.44 NodeHostName=gpu5
RealMemory=386000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=DOWN+NOT_RESPONDING ThreadsPerCore=1 TmpDisk=0 Weight=300 Owner=N/A MCS_label=N/A
Partitions=chemcourses
BootTime=None SlurmdStartTime=None
LastBusyTime=2023-04-17T19:07:50
CfgTRES=cpu=12,mem=386000M,billing=12,gres/gpu=10
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [slurm@2023-02-07T13:54:48]
显示原因是没有反应。
这个时候,可以尝试登录到gpu5这个节点,然后检查下slurmd服务是否正常。 有必要的话,重启该服务。 systemctl restart slurmd
获取硬件信息
sinfo -Nel
这里的S:C:T分别表示socket,cores,threads。 以agpu1为例,它显示2:10:1,表示它有2个CPU,每个CPU有10个核心,每个核心都是单线程的。
Tue Apr 18 14:21:15 2023
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
agpu1 1 aquila idle 20 2:10:1 128490 0 300 gpu,2630 none
agpu2 1 aquila mixed 20 2:10:1 257500 0 300 gpu,2630 none
agpu3 1 aquila idle 20 2:10:1 257500 0 300 gpu,2630 none
agpu4 1 aquila mixed 40 2:20:1 385000 0 400 gpu,g623 none
agpu5 1 aquila mixed 40 2:20:1 385000 0 400 gpu,g623 none
agpu6 1 aquila idle 40 2:20:1 191000 0 400 gpu,g623 none
agpu7 1 aquila mixed 64 2:16:2 385000 0 400 gpu,g622 none
agpu8 1 aquila mixed 32 2:16:1 385000 0 400 gpu,g622 none
agpu9 1 aquila mixed 112 2:28:2 510000 0 400 gpu,g633 none
aquila1 1 aquila mixed 40 4:10:1 206380 0 200 cpu,4820 none
aquila2 1 aquila mixed 28 2:14:1 256550 0 100 cpu,2680 none
aquila3 1 aquila mixed 28 2:14:1 256550 0 100 cpu,2680 none
aquila4 1 aquila mixed 28 2:14:1 256550 0 100 cpu,2680 none
aquila5 1 aquila mixed 40 2:20:1 286550 0 100 cpu,g614 none
aquila6 1 aquila mixed 40 2:20:1 191000 0 100 cpu,g614 none
argon1 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon2 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon3 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon4 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon5 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon6 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon7 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon8 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon9 1 argon mixed 24 2:12:1 191000 0 100 cpu,g614 none
argon10 1 argon idle 24 2:12:1 191000 0 100 cpu,g614 none
argon11 1 argon idle 24 2:12:1 191000 0 100 cpu,g614 none
argon12 1 argon idle 24 2:12:1 191000 0 100 cpu,g614 none
argon13 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon14 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon15 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon16 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon17 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon18 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon19 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon20 1 argon idle 28 2:14:1 191000 0 400 cpu,g613 none
argon21 1 argon idle 96 4:24:1 384936 0 400 cpu,9242 none
argon22 1 argon idle 96 4:24:1 384936 0 400 cpu,9242 none
argpu1 1 argon idle 40 2:20:1 239000 0 400 gpu,s431 none
chen1 1 chen idle 40 2:20:1 384963 0 200 cpu,g614 none
chen2 1 chen allocated 80 4:20:1 772000 0 1000 cpu,g614 none
chen3 1 chen mixed 96 4:24:1 772000 0 100 cpu,g633 none
chen4 1 chen mixed 96 4:24:1 772000 0 400 gpu,g633 none
chen5 1 chen allocated 96 4:24:1 772000 0 400 gpu,g633 none
compute24 1 parallel idle 24 2:12:1 257500 0 10 cpu,2650 none
compute25 1 parallel mixed 24 2:12:1 257500 0 10 cpu,2650 none
compute26 1 parallel idle 24 2:12:1 257500 0 10 cpu,2650 none
compute27 1 bioclass idle 24 2:12:1 257500 0 10 cpu,2650 none
compute27 1 parallel idle 24 2:12:1 257500 0 10 cpu,2650 none
compute28 1 parallel mixed 40 2:20:1 191000 0 100 cpu,g614 none
compute29 1 debug* allocated 40 2:20:1 385000 0 100 cpu,g614 none
compute29 1 parallel allocated 40 2:20:1 385000 0 100 cpu,g614 none
compute30 1 parallel allocated 40 2:20:1 385000 0 100 cpu,g614 none
compute31 1 parallel allocated 40 2:20:1 385000 0 100 cpu,g614 none
compute32 1 parallel allocated 40 2:20:1 385000 0 100 cpu,g614 none
compute33 1 parallel idle 28 2:14:1 191000 0 400 cpu,g613 none
compute34 1 parallel idle 28 2:14:1 191000 0 400 cpu,g613 none
compute35 1 parallel idle 28 2:14:1 191000 0 400 cpu,g613 none
compute36 1 parallel idle 28 2:14:1 191000 0 400 cpu,g613 none
compute37 1 parallel allocated 40 2:20:1 770600 0 400 cpu,g624 none
compute38 1 parallel idle 40 2:20:1 770600 0 400 cpu,g624 none
compute39 1 parallel idle 40 2:20:1 770600 0 400 cpu,g624 none
compute40 1 parallel idle 40 2:20:1 770600 0 400 cpu,g624 none
compute41 1 parallel idle 40 2:20:1 770600 0 400 cpu,g624 none
compute42 1 parallel idle 40 2:20:1 770600 0 400 cpu,g624 none
compute43 1 parallel idle 40 2:20:1 708676 0 400 cpu,g624 none
compute44 1 parallel inval 40 2:20:1 770600 0 400 cpu,g624 Low RealMemory (repo
compute45 1 parallel idle 40 2:20:1 770600 0 400 cpu,g624 none
compute46 1 parallel allocated 40 2:20:1 770600 0 400 cpu,g624 none
compute47 1 parallel idle 40 2:20:1 770600 0 400 cpu,g624 none
compute48 1 parallel inval 40 2:20:1 770600 0 400 cpu,g624 Not responding
compute49 1 parallel idle 40 2:20:1 770600 0 400 cpu,g624 none
compute50 1 parallel allocated 40 2:20:1 770600 0 400 cpu,g624 none
compute51 1 parallel idle 32 2:16:1 384936 0 400 cpu,6226 none
compute52 1 parallel idle 32 2:16:1 384936 0 400 cpu,6226 none
felix1 1 felix idle 96 4:24:1 770600 0 400 cpu,6330 none
felix2 1 felix idle 96 4:24:1 770600 0 400 cpu,6330 none
felix3 1 felix idle 96 4:24:1 770600 0 400 cpu,6330 none
gpu4 1 chemcourses allocated 12 2:6:1 386000 0 300 gpu,2643 none
gpu5 1 chemcourses down* 12 2:6:1 386000 0 300 gpu,2643 Not responding
gpu6 1 gpu idle 20 2:10:1 128490 0 300 gpu,2630 none
gpu7 1 gpu mixed 40 2:20:1 770000 0 400 gpu,g623 none
gpu8 1 chem allocated 32 2:16:1 128300 0 400 gpu,g521 none
gpu9 1 chem allocated 32 2:16:1 128300 0 400 gpu,g521 none
gpu10 1 sci mixed 52 2:26:1 256500 0 400 gpu,g532 none
gpu11 1 sci mixed 56 2:28:1 510000 0 400 gpu,g633 none
hetao1 1 hetao idle 28 2:14:1 127229 0 600 cpu,2660 none
hetao1 1 bioclass idle 28 2:14:1 127229 0 600 cpu,2660 none