SLURM: srun_fastsurfer.sh

Usage

$ ./srun_fastsurfer.sh --help
Script to orchestrate resource-optimized FastSurfer runs on SLURM clusters.

Usage:
srun_fastsurfer.sh [--data <directory to search images>]
    [--sd <output directory>] [--work <work directory>]
    (--pattern <search pattern for images>|--subject_list <path to subject_list file>
                                           [--subject_list_delim <delimiter>]
                                           [--subject_list_awk_code_sid <subject_id code>]
                                           [--subject_list_awk_code_t1 <image_path code>])
    [--singularity_image <path to fastsurfer singularity image>]
    [--extra_singularity_options [(seg|surf)=]<singularity option string>] [--num_cases_per_task <number>]
    [--num_cpus_per_task <number of cpus to allocate for seg>] [--cpu_only] [--time (surf|seg)=<timelimit>]
    [--partition [(surf|seg)=]<slurm partition>] [--slurm_jobarray <jobarray specification>] [--skip_cleanup]
    [--email <email address>] [--debug] [--dry] [--help]
    [<additional fastsurfer options>]

Author:   David Kügler, david.kuegler@dzne.de
Date:     Nov 3, 2023
Version:  1.0
License:  Apache License, Version 2.0

Documentation of Options:
General options:
--dry: performs all operations, but does not actually submit jobs to slurm
--debug: Additional debug output.
--help: print this help.

Data- and subject-related options:
--sd: output directory will have N+1 subdirectories (default: $(pwd)/processed):
  - one directory per case plus
  - slurm with two subdirectories:
    - logs for slurm logs,
    - scripts for intermediate slurm scripts for debugging.
  Note: files will be copied here only after all jobs have finished, so most IO happens on
  a work directory, which can use IO-optimized cluster storage (see --work).
--work: directory with fast filesystem on cluster
  (default: $HPCWORK/fastsurfer-processing/20240701-190056)
  NOTE: THIS SCRIPT considers this directory to be owned by this script and job!
  No modifications should be made to the directory after the job is started until it is
  finished (if the job fails, cleanup of this directory may be necessary) and it should be
  empty!
--data: (root) directory to search in for t1 files (default: current work directory).
--pattern: glob string to find image files in 'data directory' (default: *.{nii,nii.gz,mgz}),
   for example --data /data/ --pattern \*/\*/mri/t1.nii.gz
   will find all images of format /data/<somefolder>/<otherfolder>/mri/t1.nii.gz
--subject_list: alternative way to define cases to process, files are of format:
  subject_id1=/path/to/t1.mgz
  ...
  This option invalidates the --pattern option.
  May also add additional parameters like:
  subject_id1=/path/to/t1.mgz --vox_size 1.0
--subject_list_delim: alternative delimiter in the file (default: "="). For example, if you
  pass --subject_list_delim "," the subject_list file is parsed as a comma-delimited csv file.
--subject_list_awk_code_sid <subject_id code>: alternative way to construct the subject_id
  from the row in the subject_list (default: '$1').
--subject_list_awk_code_args <t1_path code>: alternative way to construct the image_path and
  additional parameters from the row in the subject_list (default: '$2'), other examples:
  '$2/$1/mri/orig.mgz', where the first field (of the subject_list file) is the subject_id
  and the second field is the containing folder, e.g. the study.
  Example for additional parameters:
  --subject_list_delim "," --subject_list_awk_code_args '$2 " --vox_size " $4'
  to implement from the subject_list line
  subject-101,raw/T1w-101A.nii.gz,study-1,0.9
  to (additional arguments must be comma-separated)
  --sid subject-101 --t1 <data-path>/raw/T1w-101A.nii.gz --vox_size 0.9

FastSurfer options:
--fs_license: path to the freesurfer license (either absolute path or relative to pwd)
--seg_only: only run the segmentation pipeline
--surf_only: only run the surface pipeline (--sd must contain previous --seg_only processing)
--***: also standard FastSurfer options can be passed, like --3T, --no_cereb, etc.

Singularity-related options:
--singularity_image: Path to the singularity image to use for segmentation and surface
  reconstruction (default: $HOME/singularity-images/fastsurfer.sif).
--extra_singularity_options: Extra options for singularity, needs to be double quoted to allow quoted strings,
  e.g. --extra_singularity_options "-B /$(echo \"/path-to-weights\"):/fastsurfer/checkpoints".
  Supports two formats similar to --partition: --extra_singularity_options <option string> and
  --extra_singularity_options seg=<option string> and --extra_singularity_options surf=<option string>.

SLURM-related options:
--cpu_only: Do not request gpus for segmentation (only affects segmentation, default: request gpus).
--num_cpus_per_task: number of cpus to request for segmentation pipeline of FastSurfer (--seg_only),
  (default: 16).
--num_cases_per_task: number of cases batched into one job (slurm jobarray), will process
  in cases in parallel, id num_cases_per_task is smaller than the total number of cases,
  (default: 16).
--skip_cleanup: Do not schedule step 3, cleanup (which moves the data from --work to --sd, etc.,
  default: do the cleanup).
--slurm_jobarray: a slurm-compatible list of jobs to run, this can be used to rerun segmentation cases
  that have failed, for example '--slurm_jobarray 4,7' would only run the cases associated with a
  (previous) run of srun_fastsurfer.sh, where log files '<sd>/logs/seg_*_{4,7}.log' indicate failure.
--partition: (comma-separated list of) partition(s), supports 2 formats (and their combination):
   --partition seg=<slurm partition>,<other slurm partition>: will schedule the segmentation job on
     listed slurm partitions. It is recommended to select nodes/partitions with GPUs here.
   --partition surf=<partitions>: will schedule surface reconstruction jobs on listed partitions
   --partition <slurm partition>: default partition to used, if specific partition is not given
     (one of the above).
  default: slurm default partition
--time: a per-subject time limit for individual steps, must be number in minutes:
   --time seg=<timelimit>: time limit for the segmentation pipeline (per subject), default: seg=5 (5min).
   --time surf=<timelimit>: time limit for the surface reconstruction (per subject), default: surf=180 (180min)
--email: email address to send slurm status updates.

Accepts additional FastSurfer options, such as --seg_only and --surf_only and only performs the
respective pipeline.
This script will start three slurm jobs:
1. a segmentation job (alternatively, if --surf_only, this copies previous segmentation data from
  the subject directory (--sd)
2. a surface reconstruction job (skipped, if --seg_only)
3. a cleanup job, that moves the data from the work directory (--work) to the subject directory
  (--sd).

Jobs will be grouped into slurm job_arrays with serial segmentation and parallel surface
reconstruction (via job arrays and job steps). This way, segmentation can be scheduled on machines
with GPUs and surface reconstruction on machines without, while efficiently assigning cpus and gpus,
see --partition flag.
Note, that a surface reconstruction job will request up to a total of '<num_cases_per_task> * 2'
cpus and '<num_cases_per_task> * 10G' memory per job. However, these can be distributed across
'<num_cases_per_task>' nodes in parallel job steps.

This tool requires functions in stools.sh and the brun_fastsurfer.sh scripts (expected in same
folder as this script) in addition to the fastsurfer singularity image.

Debugging SLURM runs

  1. Did the run succeed?

    1. Check whether all jobs are done (specifically the copy job).

      $ squeue -u $USER --Format JobArrayID,Name,State,Dependency
      1750814_3           FastSurfer-Seg-kueglRUNNING             (null)
      1750815_3           FastSurfer-Surf-kuegPENDING             aftercorr:1750814_*(
      1750816             FastSurfer-Cleanup-kPENDING             afterany:1750815_*(u
      1750815_1           FastSurfer-Surf-kuegRUNNING             (null)
      1750815_2           FastSurfer-Surf-kuegRUNNING             (null)
      

      Here, jobs are not finished yet. The FastSurfer-Cleanup-$USER Job moves data to the subject directory (–sd).

    2. Check whether there are subject folders and log files in the subject directory, /slurm/logs for the latter.

    3. Check the subject_success file in /slurm/scripts. It should have a line for each subject for both parts of the FastSurfer pipeline, e.g. <subject id>: Finished --seg_only successfully or <subject id>: Finished --surf_only successfully! If one of these is missing, the job was likely killed by slurm (e.g. because of the time or the memory limit).

    4. For subjects that were unsuccessful (The subject_success will say so), check <subject directory>/<subject id>/scripts/deep-seg.log and <subject directory>/<subject id>/scripts/recon-surf.log to see what failed. Can be found by looking for “: Failed <–seg_only/–surf_only> with exit code ” in <subject directory>/slurm/scripts/subject_success.

    5. For subjects that were terminated (missing in subject_success), find which job is associated with subject id grep "<subject id>" slurm/logs/surf_*.log, then look at the end of the job and the job step logs (surf_XXX_YY.log and surf_XXX_YY_ZZ.log). If slurm terminated the job, it will say so there. You can increase the time and memory budget in srun_fastsurfer.sh with --time and --mem flags. The following bash code snippet can help identify failed runs.

      cd <subject directory>
      for sub in *
      do
      if [[ -z "$(grep "$sub: Finished --surf" slurm/scripts/subject_success)" ]]
      then
          echo "$sub was terminated externally"
      fi
      done