HPC systems¶
shark can run not only in desktop computers and laptops but also in High Performance Computing (HPC) clusters. Some special considerations need to be taken into account when using shark in these environments, which are detailed in this section.
In this section we first briefly explain some basic HPC concepts for people who are not familiar with HPC systems (feel free to skip otherwise), and then describe how to build and run shark in these environments.
Brief introduction to HPC¶
Note
Feel free to skip this section if you know your way around HPC systems already
Modules¶
One of the main differences between personal computers and HPC systems is that in these environments requirements (sometimes including the compiler itself) are usually installed in the form of modules. Modules are software installations that can be loaded and unloaded interactively, making the installed software within them available and unavailable at will. For a full description of how modules work see the modules documentation.
Please refer to your HPC centre’s documentation if you need more details on installed modules, or if you are missing a required dependency.
Queues¶
Note
Feel free to skip this section if you know what queues are and how they operate
HPC centres usually do not allow executing programs in nodes interactively. Instead, all users need to submit a job to a queue. A job is simply a single script containing all the steps needed to run your work in an automatic way. When submitting jobs, users need to specify how many resources it will need (e.g., how many CPUs, computers or memory it requires). Jobs are then eventually taken out of the queue and executed in the cluster nodes. When the job is executed depends on a few factors, like how many resources your job is requesting, the amount of currently available resources, and your priority, among others.
Several queueing systems exist, with SLURM and PBS/TORQUE being the two most popular ones. Although from a high-level perspective they both offer similar functionality, the particulars on how to use them are quite different. For example, submitting jobs to the queue is done with different commands on each system, and resource requirements are specified differently.
Building¶
Building shark in HPC systems follows the same procedure
used to build it in personal computers and laptops,
with the additional task
of having to load the necessary modules
containing the requirements for building shark.
To do this, use module avail xyz
to look for them
(e.g., module avail boost
)
then load them with module load xyz
(e.g., module load boost
).
If all goes well,
cmake
will be able to find the requirements.
If you are missing a module you have a few alternatives:
- Contact your HPC cluster support and ask them to install the missing module
- Build the software yourself (and optionally install it as a personal module)
Intel compiler¶
Depending on the version,
OpenMP support on the Intel compiler is a bit difficult to identify.
Until version 3.9 cmake
was not able to identify OpenMP support
for newer Intel compilers,
and simply using a version of cmake
>= 3.9 will solve the issue.
If you find yourself in this situation,
a big warning message will appear when running cmake
to alert you and guide you in what to do.
Running with shark-submit
¶
As stated in Scalability, shark scales thanks to its input data being independently separated in sub-volumes, and thanks to its multithreading support via OpenMP. When running on an HPC cluster one should take advantage of these two aspects to efficiently execute shark across the available resources.
To this end shark ships with a shark-submit
script
that can be used to easily submit jobs
that will run shark over a list of sub-volumes
in an HPC cluster.
Internally, the script will spawn independent shark instances
for each of the requested sub-volumes,
and will instruct each instance to use multiple CPUs.
shark-submit
not only eases this submission process;
it also abstracts away the differences between queueing systems,
making it easy for users to specify exactly what they want
without having to remember particular commands and formats.
It also creates the submissions in such a way
that all artifacts resulting from a submission
(i.e., log files, environment information, even plots)
end up in separate, well defined locations,
making it easy to inspect them
and shared them if needed.
Note
Only SLURM is currently supported, but TORQUE will be supported, and more systems might come with time.
The shark-submit
script lives
under the hpc
subdirectory of the shark repository.
Its most basic usage looks like this:
$> hpc/shark-submit -V 0 <config_file>
That will submit an execution of shark only for sub-volume 0
using the config_file
configuration file.
Running with different parameter sets¶
As explained above,
the default mode of execution of shark-submit
parallelizes shark executions by sub-volume,
using the same configuration.
However, a second mode is supported,
where users can specify different parameter sets
to be evaluated against the same inputs.
This is useful, for instance,
when one is sampling a parameter search space
to optimize shark against certain constraints,
or during other exploratory exercises.
This mode is triggered by using the -E file
flag.
file
must contain all the command-line flags
that will be given to each shark instance, one per row.
For example:
-o "reincorporation.tau_reinc=9.789522070051014" -o "reincorporation.mhalo_norm=53167281575.647736" -o "reincorporation.halo_mass_power=-1.662049864221243"
-o "reincorporation.tau_reinc=4.433571656151598" -o "reincorporation.mhalo_norm=344951728442.8235" -o "reincorporation.halo_mass_power=-2.197944428980997"
-o "reincorporation.tau_reinc=8.744659838237162" -o "reincorporation.mhalo_norm=106081569566.5114" -o "reincorporation.halo_mass_power=-2.2146743876637798"
-o "reincorporation.tau_reinc=5.568250109183069" -o "reincorporation.mhalo_norm=36854502778.199234" -o "reincorporation.halo_mass_power=-1.909464612909543"
-o "reincorporation.tau_reinc=2.9521943986079364" -o "reincorporation.mhalo_norm=1916185243.0129645" -o "reincorporation.halo_mass_power=-2.797548035509215"
When this mode is used,
the -V
flag that indicates subvolumes
applies equally to all shark instances.
For example, if the user specifies -V 0-3 -E simple.txt
,
and simple.txt
contains three lines,
then only three shark instances will be spawned.
Options¶
shark-submit
supports many options,
which are roughly grouped into the following categories:
- Queueing: they include which queue to submit to, how many resources are needed (memory, CPUs and/or nodes), and more.
- Plotting: these control whether to produce the standard plots.
- Shark: these are shark-specific options, like which particular shark binary to use, and which sub-volumes to process and the configuration file to use.
- Other: modules to load, output directory to use, etc.
For a full help on all available options run:
$> hpc/shark-submit -h
Environment variables¶
Some of the options of shark-submit
will probably remain the same
across most (if not all) executions.
Because of these, a handful of environment variables
are inspected by shark-submit
and interpreted
as the default value for some of these options
(run shark-submit -h
for a full list).
You can thus define these variables once
(e.g., in your ~/.bash_rc
or ~/.bash_profile
files)
to avoid having to repeat typing them each time.