Lifecycle - Slurm mode¶
The lifecycle of a Ponos agent in Slurm mode is similar to that of the generic Ponos agent. Only the actions specific to task management differ.
Warning
The Slurm mode is reserved for super-computers. Most users should rather focus on the Docker mode.
Setup¶
A Ponos agent in Slurm mode can be launched:
- with a Python script, like the generic Ponos agent.
- with an auto-requeued Slurm job (using the
slurm.base
and theslurm.daemon
parameters of its configuration).
In addition to the generic Ponos agent setup, the Ponos agent in Slurm mode will:
- Check that the PySlurm library is available. This library will be used by the agent to interact with Slurm.
- Check that there is no other Ponos agent on the same host. Like the generic Ponos agent, it uses a unique file containing the PID of the Ponos agent currently running. But it will also check that there is no Ponos agent running in a Slurm job. If there is a different Slurm job to the current agent, the agent stops.
- Mark its presence on the host. Like the generic Ponos agent, it writes its PID to the unique file. But if the agent is not in a Slurm job and the configuration has the
slurm.daemon
parameter, then it will launch a Ponos agent in Slurm mode in a Slurm job (with the same configuration) which will automatically requeue itself at the end of its execution time and the current agent will stop (using theslurm.base
and theslurm.daemon
parameters of its configuration). - List the tasks running on the host. The agent will list all Slurm jobs. For each Slurm job, the agent will:
- Check whether the Slurm job has already been fully processed. If so, the Slurm job is ignored.
- Retrieve the task’s ID related to the Slurm job using the job’s comment (which is of the form
PONOS_TASK=<task_id>
). - Retrieve the task’s details (using the
RetrieveTaskDefinition
endpoint). If the task no longer exists or is assigned to another agent, the Slurm job is ignored. - Add the task to the list of running tasks.
Loop¶
For its loop, the Ponos agent in Slurm mode uses the same lifecycle as the generic Ponos agent. However, there are some specific points to note when the agent checks the running tasks:
- To retrieve the task’s logs, the agent uses the output file related to the Slurm job.
- To find out the exit code, the agent aggregates the exit codes from all the steps of the Slurm job.
- The agent updates the task’s state to
Running
only if the Slurm job is in theRunning
orCompleting
state.
Start task¶
For a Ponos agent in Slurm mode, starting a task means:
- Clone the task’s Git repository.
- Checkout the correct commit of the task’s Git repository.
- Create a virtual environment.
- Install the Git repository in the virtual environment according to the
install_ponos_slurm.sh
file at the root of the Git repository. The agent will define thePIP_FLAGS
variable to setup the virtual environment. By default, the agent will use thepip install "${PIP_FLAGS}" .
command. - Start the task in a Slurm job (using the
slurm.base
,slurm.cpu
andslurm.gpu
parameters of its configuration) according to thedocker.command
attribute defined in the.arkindex.yml
file. If the task requires GPU, then the agent will consider that the task will not have access to Internet. So that the task can still be processed, the agent will create a configuration file aggregating the various task’s configurations (worker version’s configuration, model’s configuration, user configuration) (using theRetrieveWorkerRun
endpoint).
Note
Once the task has been launched, its state will not be updated to Running
(like the Ponos agent) but will remain Pending
to correspond to the state of the Slurm job, which is put on the waiting list and does not necessarily start immediately.
Stop task¶
For a Ponos agent in Slurm mode, stopping a task means:
- Cancel the Slurm job related to the task.