Skip to content

Lifecycle - Slurm mode

The lifecycle of a Ponos agent in Slurm mode is similar to that of the generic Ponos agent. Only the actions specific to task management differ.

Warning

The Slurm mode is reserved for super-computers. Most users should rather focus on the Docker mode.

Setup

A Ponos agent in Slurm mode can be launched:

  • with a Python script, like the generic Ponos agent.
  • with an auto-requeued Slurm job (using the slurm.base and the slurm.daemon parameters of its configuration).

In addition to the generic Ponos agent setup, the Ponos agent in Slurm mode will:

  • Check that the PySlurm library is available. This library will be used by the agent to interact with Slurm.
  • Check that there is no other Ponos agent on the same host. Like the generic Ponos agent, it uses a unique file containing the PID of the Ponos agent currently running. But it will also check that there is no Ponos agent running in a Slurm job. If there is a different Slurm job to the current agent, the agent stops.
  • Mark its presence on the host. Like the generic Ponos agent, it writes its PID to the unique file. But if the agent is not in a Slurm job and the configuration has the slurm.daemon parameter, then it will launch a Ponos agent in Slurm mode in a Slurm job (with the same configuration) which will automatically requeue itself at the end of its execution time and the current agent will stop (using the slurm.base and the slurm.daemon parameters of its configuration).
  • List the tasks running on the host. The agent will list all Slurm jobs. For each Slurm job, the agent will:
    • Check whether the Slurm job has already been fully processed. If so, the Slurm job is ignored.
    • Retrieve the task’s ID related to the Slurm job using the job’s comment (which is of the form PONOS_TASK=<task_id>).
    • Retrieve the task’s details (using the RetrieveTaskDefinition endpoint). If the task no longer exists or is assigned to another agent, the Slurm job is ignored.
    • Add the task to the list of running tasks.

Loop

For its loop, the Ponos agent in Slurm mode uses the same lifecycle as the generic Ponos agent. However, there are some specific points to note when the agent checks the running tasks:

  • To retrieve the task’s logs, the agent uses the output file related to the Slurm job.
  • To find out the exit code, the agent aggregates the exit codes from all the steps of the Slurm job.
  • The agent updates the task’s state to Running only if the Slurm job is in the Running or Completing state.

Start task

For a Ponos agent in Slurm mode, starting a task means:

  • Clone the task’s Git repository.
  • Checkout the correct commit of the task’s Git repository.
  • Create a virtual environment.
  • Install the Git repository in the virtual environment according to the install_ponos_slurm.sh file at the root of the Git repository. The agent will define the PIP_FLAGS variable to setup the virtual environment. By default, the agent will use the pip install "${PIP_FLAGS}" . command.
  • Start the task in a Slurm job (using the slurm.base, slurm.cpu and slurm.gpu parameters of its configuration) according to the docker.command attribute defined in the .arkindex.yml file. If the task requires GPU, then the agent will consider that the task will not have access to Internet. So that the task can still be processed, the agent will create a configuration file aggregating the various task’s configurations (worker version’s configuration, model’s configuration, user configuration) (using the RetrieveWorkerRun endpoint).

Note

Once the task has been launched, its state will not be updated to Running (like the Ponos agent) but will remain Pending to correspond to the state of the Slurm job, which is put on the waiting list and does not necessarily start immediately.

Stop task

For a Ponos agent in Slurm mode, stopping a task means:

  • Cancel the Slurm job related to the task.