Labtasker¶
Introduction¶
Labtasker is an easy-to-use task queue system designed for dispatching lab experiment tasks to user-defined workers.
It enables users to submit various experiment arguments to a server-based task queue. Worker nodes can then retrieve and execute these tasks from the queue.
Unlike traditional HPC resource management systems like SLURM, Labtasker is tailored for users rather than system administrators.
Motivation¶
Why not simple bash scripts?¶
Imagine you have several lab experiments to run on a single GPU, each with multiple parameters to configure.
The simplest approach is to create a script for each experiment and execute them sequentially.
for my_param_1 in 1 2 3 4; do
for my_param_2 in 1 2 3 4; do
for my_param_3 in 1 2 3 4; do
python run_my_experiment.py --param1 $my_param_1 --param2 $my_param_2 --param3 $my_param_3
done
done
done
This method works, but what if you have more than one GPU?
Let's say you have 4 GPUs. You can split the experiments into 4 groups and run them in parallel to make better use of the GPU resources.
# Use my_param_1 to divide the experiments into 4 groups for 4 GPUs
# my_experiment_1.sh
my_param_1=1
for my_param_2 in 1 2 3 4; do
for my_param_3 in 1 2 3 4; do
python run_my_experiment.py --param1 $my_param_1 --param2 $my_param_2 --param3 $my_param_3
done
done
# my_experiment_2.sh
my_param_1=2
for my_param_2 in 1 2 3 4; do
for my_param_3 in 1 2 3 4; do
python run_my_experiment.py --param1 $my_param_1 --param2 $my_param_2 --param3 $my_param_3
done
done
# my_experiment_3.sh
my_param_1=3
for my_param_2 in 1 2 3 4; do
for my_param_3 in 1 2 3 4; do
python run_my_experiment.py --param1 $my_param_1 --param2 $my_param_2 --param3 $my_param_3
done
done
# my_experiment_4.sh
my_param_1=4
for my_param_2 in 1 2 3 4; do
for my_param_3 in 1 2 3 4; do
python run_my_experiment.py --param1 $my_param_1 --param2 $my_param_2 --param3 $my_param_3
done
done
However, this method can quickly become unwieldy and offers limited control over the experiments once the wrapper scripts are running.
- What if the parameters are difficult to divide, making it challenging to split the loop into multiple scripts?
- What if you realize some experiments are unnecessary while monitoring them live? You'd have to stop the script and modify it.
- What if you want to prioritize certain experiments after reviewing initial results? You'd have to stop the script and modify it.
- What if you want to add more experiments to the queue? You'd have to stop the script and modify it.
- What if some experiments fail? You'd need to create new scripts to restart them.
Labtasker is designed to overcome these challenges.
Why not SLURM?¶
Labtasker is designed to be a simple and easy-to-use.
It disentangles task queue from resource management. It offers a versatile task queue system that can be used by anyone (not just system administrators), without the need for extensive configuration or knowledge of HPC systems.
Here's are key conceptual differences between Labtasker and SLURM:
Aspects | SLURM | Labtasker |
---|---|---|
Purpose | HPC resource management system | Task queue system for lab experiments |
Who is it for | Designed for system administrators | Designed for users |
Configuration | Requires extensive configuration | Minimal configuration needed |
Task Submission | Jobs submitted as scripts with resource requirements | Tasks submitted as argument groups (JSON dictionaries) |
Resource Handling | Allocates resources and runs the job | Does not explicitly handle resource allocation |
Flexibility | Assumes specific resource and task types | No assumptions about task nature, experiment type, or computation resources |
Execution | Runs jobs on allocated resources | User-defined worker scripts run on various machines/GPUs/CPUs and decide how to handle the arguments |
Reporting | Handled by the framework | Reports results back to the server via API |