HPC systems

shark can run not only in desktop computers and laptops but also in High Performance Computing (HPC) clusters. Some special considerations need to be taken into account when using shark in these environments, which are detailed in this section.

In this section we first briefly explain some basic HPC concepts for people who are not familiar with HPC systems (feel free to skip otherwise), and then describe how to build and run shark in these environments.

Brief introduction to HPC

Note

Feel free to skip this section if you know your way around HPC systems already

Modules

One of the main differences between personal computers and HPC systems is that in these environments requirements (sometimes including the compiler itself) are usually installed in the form of modules. Modules are software installations that can be loaded and unloaded interactively, making the installed software within them available and unavailable at will. For a full description of how modules work see the modules documentation.

Please refer to your HPC centre’s documentation if you need more details on installed modules, or if you are missing a required dependency.

Queues

Note

Feel free to skip this section if you know what queues are and how they operate

HPC centres usually do not allow executing programs in nodes interactively. Instead, all users need to submit a job to a queue. A job is simply a single script containing all the steps needed to run your work in an automatic way. When submitting jobs, users need to specify how many resources it will need (e.g., how many CPUs, computers or memory it requires). Jobs are then eventually taken out of the queue and executed in the cluster nodes. When the job is executed depends on a few factors, like how many resources your job is requesting, the amount of currently available resources, and your priority, among others.

Several queueing systems exist, with SLURM and PBS/TORQUE being the two most popular ones. Although from a high-level perspective they both offer similar functionality, the particulars on how to use them are quite different. For example, submitting jobs to the queue is done with different commands on each system, and resource requirements are specified differently.

Building

Building shark in HPC systems follows the same procedure used to build it in personal computers and laptops, with the additional task of having to load the necessary modules containing the requirements for building shark. To do this, use module avail xyz to look for them (e.g., module avail boost) then load them with module load xyz (e.g., module load boost). If all goes well, cmake will be able to find the requirements.

If you are missing a module you have a few alternatives:

  • Contact your HPC cluster support and ask them to install the missing module
  • Build the software yourself (and optionally install it as a personal module)

Intel compiler

Depending on the version, OpenMP support on the Intel compiler is a bit difficult to identify. Until version 3.9 cmake was not able to identify OpenMP support for newer Intel compilers, and simply using a version of cmake >= 3.9 will solve the issue. If you find yourself in this situation, a big warning message will appear when running cmake to alert you and guide you in what to do.

Running with shark-submit

As stated in Scalability, shark scales thanks to its input data being independently separated in sub-volumes, and thanks to its multithreading support via OpenMP. When running on an HPC cluster one should take advantage of these two aspects to efficiently execute shark across the available resources.

To this end shark ships with a shark-submit script that can be used to easily submit jobs that will run shark over a list of sub-volumes in an HPC cluster. Internally, the script will spawn independent shark instances for each of the requested sub-volumes, and will instruct each instance to use multiple CPUs.

shark-submit not only eases this submission process; it also abstracts away the differences between queueing systems, making it easy for users to specify exactly what they want without having to remember particular commands and formats. It also creates the submissions in such a way that all artifacts resulting from a submission (i.e., log files, environment information, even plots) end up in separate, well defined locations, making it easy to inspect them and shared them if needed.

Note

Only SLURM is currently supported, but TORQUE will be supported, and more systems might come with time.

The shark-submit script lives under the hpc subdirectory of the shark repository. Its most basic usage looks like this:

$> hpc/shark-submit -V 0 <config_file>

That will submit an execution of shark only for sub-volume 0 using the config_file configuration file.

Running with different parameter sets

As explained above, the default mode of execution of shark-submit parallelizes shark executions by sub-volume, using the same configuration. However, a second mode is supported, where users can specify different parameter sets to be evaluated against the same inputs. This is useful, for instance, when one is sampling a parameter search space to optimize shark against certain constraints, or during other exploratory exercises.

This mode is triggered by using the -E file flag. file must contain all the command-line flags that will be given to each shark instance, one per row. For example:

-o "reincorporation.tau_reinc=9.789522070051014" -o "reincorporation.mhalo_norm=53167281575.647736" -o "reincorporation.halo_mass_power=-1.662049864221243"
-o "reincorporation.tau_reinc=4.433571656151598" -o "reincorporation.mhalo_norm=344951728442.8235" -o "reincorporation.halo_mass_power=-2.197944428980997"
-o "reincorporation.tau_reinc=8.744659838237162" -o "reincorporation.mhalo_norm=106081569566.5114" -o "reincorporation.halo_mass_power=-2.2146743876637798"
-o "reincorporation.tau_reinc=5.568250109183069" -o "reincorporation.mhalo_norm=36854502778.199234" -o "reincorporation.halo_mass_power=-1.909464612909543"
-o "reincorporation.tau_reinc=2.9521943986079364" -o "reincorporation.mhalo_norm=1916185243.0129645" -o "reincorporation.halo_mass_power=-2.797548035509215"

When this mode is used, the -V flag that indicates subvolumes applies equally to all shark instances. For example, if the user specifies -V 0-3 -E simple.txt, and simple.txt contains three lines, then only three shark instances will be spawned.

Options

shark-submit supports many options, which are roughly grouped into the following categories:

  • Queueing: they include which queue to submit to, how many resources are needed (memory, CPUs and/or nodes), and more.
  • Plotting: these control whether to produce the standard plots.
  • Shark: these are shark-specific options, like which particular shark binary to use, and which sub-volumes to process and the configuration file to use.
  • Other: modules to load, output directory to use, etc.

For a full help on all available options run:

$> hpc/shark-submit -h

Environment variables

Some of the options of shark-submit will probably remain the same across most (if not all) executions. Because of these, a handful of environment variables are inspected by shark-submit and interpreted as the default value for some of these options (run shark-submit -h for a full list). You can thus define these variables once (e.g., in your ~/.bash_rc or ~/.bash_profile files) to avoid having to repeat typing them each time.