flambe.cluster.instance.instance

This modules includes base Instance classes to represent machines.

All Instance objects will be managed by Cluster objects (flambe.cluster.cluster.Cluster).

This base implementation is independant to the type of instance used.

Any new instance that flambe should support should inherit from the classes that are defined in this module.

Module Contents

flambe.cluster.instance.instance.logger[source]
flambe.cluster.instance.instance.InsT[source]
class flambe.cluster.instance.instance.Instance(host: str, private_host: str, username: str, key: str, config: ConfigParser, debug: bool, use_public: bool = True)[source]

Bases: object

Encapsulates remote instances.

In this context, the instance is a running computer.

All instances used by flambe remote mode will inherit Intance. This class provides high-level methods to deal with remote instances (for example, sending a shell command over SSH).

Important: Instance objects should be pickeable. Make sure that all child classes can be pickled.

The flambe local process will communicate with the remote instances using SSH. The authentication mechanism will be using private keys.

Parameters:
  • host (str) – The public DNS host of the remote machine.
  • private_host (str) – The private DNS host of the remote machine.
  • username (str) – The machine’s username.
  • key (str) – The path to the ssh key used to communicate to the instance.
  • config (ConfigParser) – The config object that contains useful information for the instance. For example, config[‘SSH’][‘SSH_KEY’] should contain the path of the ssh key to login the remote instance.
  • debug (bool) – True in case flambe was installed in dev mode, False otherwise.
  • use_public (bool) – Wether this instance should use public or private IP. By default, the public IP is used. Private host is used when inside a private LAN.
fix_relpaths_in_config(self)[source]

Updates all paths to be absolute. For example, if it contains “~/a/b/c” it will be change to /home/user/a/b/c (the appropiate $HOME value)

__enter__(self)[source]

Method to use Instance instances with context managers

Returns:The current instance
Return type:Instance
__exit__(self, exc_type: Optional[Type[BaseException]], exc_value: Optional[BaseException], traceback: Optional[TracebackType])[source]

Exit method for the context manager.

This method will catch any uprising exception and raise it.

prepare(self)[source]

Runs all neccessary processes to prepare the instances.

The child classes should implement this method according to the type of instance.

wait_until_accessible(self)[source]

Waits until the instance is accesible through SSHClient

It attempts const.RETRIES time to ping SSH port to See if it’s listening for incoming connections. In each attempt, it waits const.RETRY_DELAY.

Raises:ConnectionError – If the instance is unaccesible through SSH
is_up(self)[source]

Tests wether port 22 is open to incoming SSH connections

Returns:True if instance is listening in port 22. False otherwise.
Return type:bool
_get_cli(self)[source]

Get an SSHClient in order to execute commands.

This will cache an existing SSHClient to optimize resource. This is a private method and should only be used in this module.

Returns:The client for latter use.
Return type:paramiko.SSHClient
Raises:SSHConnectingError – In case opening an SSH connection fails.
_run_cmd(self, cmd: str, retries: int = 1, wd: str = None)[source]

Runs a single shell command in the instance through SSH.

The command will be executed in one ssh connection. Don’t expect calling several time to _run_cmd expecting to keep state between commands. To use mutliple commands, use: _run_script

Important: when running docker containers, don’t use -it flag!

This is a private method and should only be used in this module.

Parameters:
  • cmd (str) – The command to execute.
  • retries (int) – The amount of attempts to run the command if it fails. Default to 1.
  • wd (str) – The working directory to ‘cd’ before running the command
Returns:

A RemoteCommand instance with success boolean and message.

Return type:

RemoteCommand

Examples

To get $HOME env

>>> instance._run_cmd("echo $HOME")
RemoteCommand(True, "/home/ubuntu")

This will not work

>>> instance._run_cmd("export var=10")
>>> instance._run_cmd("echo $var")
RemoteCommand(False, "")

This will work

>>> instance._run_cmd("export var=10; echo $var")
RemoteCommand(True, "10")
Raises:RemoteCommandError – In case the cmd failes after retries attempts.
_run_script(self, fname: str, desc: str)[source]

Runs a script by copyinh the script to the instance and executing it.

This is a private method and should only be used in this module.

Parameters:
  • fname (str) – The script filename
  • desc (str) – A description for the script purpose. This will be used for the copied filename
Returns:

A RemoteCommand instance with success boolean and message.

Return type:

RemoteCommand

Raises:

RemoteCommandError – In case the script fails.

_remote_script(self, host_fname: str, desc: str)[source]

Sends a local file containing a script to the instance using Paramiko SFTP.

It should be used as a context manager for latter execution of the script. See _run_script on how to use it.

After the context manager exists, then the file is removed from the instance.

This is a private method and should only be used in this module.

Parameters:
  • host_fname (str) – The local script filename
  • desc (str) – A description for the script purpose. This will be used for the copied filename
Yields:

str – The remote filename of the copied local file.

Raises:

RemoteCommandError – In case sending the script fails.

run_cmds(self, setup_cmds: List[str])[source]

Execute a list of sequential commands

Parameters:setup_cmds (List[str]) – The list of commands
Returns:In case at least one command is not successful
Return type:RemoteCommandError
send_rsync(self, host_path: str, remote_path: str, params: List[str] = None)[source]

Send a local file or folder to a remote instance with rsync.

Parameters:
  • host_path (str) – The local filename or folder
  • remote_path (str) – The remote filename or folder to use
  • params (List[str], optional) – Extra parameters to be passed to rsync. For example, [“–filter=’:- .gitignore’”]
Raises:

RemoteFileTransferError – In case sending the file fails.

get_home_path(self)[source]

Return the $HOME value of the instance.

Returns:The $HOME env value.
Return type:str
Raises:RemoteCommandError – If after 3 retries it is not able to get $HOME.
clean_containers(self)[source]

Stop and remove all containers running

Raises:RemoteCommandError – If command fails
clean_container_by_image(self, image_name: str)[source]

Stop and remove all containers given an image name.

Parameters:image_name (str) – The name of the image for which all containers should be stopped and removed.
Raises:RemoteCommandError – If command fails
clean_container_by_command(self, command: str)[source]

Stop and remove all containers with the given command.

Parameters:command (str) – The command used to stop and remove the containers
Raises:RemoteCommandError – If command fails
install_docker(self)[source]

Install docker in a Ubuntu 18.04 distribution.

Raises:RemoteCommandError – If it’s not able to install docker. ie. then the installation script fails
install_extensions(self, extensions: Dict[str, str])[source]

Install local + pypi extensions.

Parameters:extension (Dict[str, str]) – The extensions, as a dict from module_name to location
Raises:errors.RemoteCommandError – If could not install an extension
install_flambe(self)[source]

Pip install Flambe.

If dev mode is activated, then it rsyncs the local flambe folder and installs that version. If not, downloads from pypi.

Raises:RemoteCommandError – If it’s not able to install flambe.
is_docker_installed(self)[source]

Check if docker is installed in the instance.

Executes command “docker –version” and expect it not to fail.

Returns:True if docker is installed. False otherwise.
Return type:bool
is_flambe_installed(self, version: bool = True)[source]

Check if flambe is installed and if it matches version.

Parameters:version (bool) – If True, also the version will be used. That is, if flag is True and the remote flambe version is different from the local flambe version, then this method will return False. If they match, then True. If version is False this method will return if there is ANY flambe version in the host.
Returns:
Return type:bool
is_docker_running(self)[source]

Check if docker is running in the instance.

Executes the command “docker ps” and expects it not to fail.

Returns:True if docker is running. False otherwise.
Return type:bool
start_docker(self)[source]

Restart docker.

Raises:RemoteCommandError – If it’s not able to restart docker.
is_node_running(self)[source]

Return if the host is running a ray node

Returns:
Return type:bool
is_flambe_running(self)[source]

Return if the host is running flambe

Returns:
Return type:bool
existing_dir(self, _dir: str)[source]

Return if a directory exists in the host

Parameters:_dir (str) – The name of the directory. It needs to be relative to $HOME
Returns:True if exists. Otherwise, False.
Return type:bool
shutdown_node(self)[source]

Shut down the ray node in the host.

If the node is also the main node, then the entire cluster will shut down

shutdown_flambe(self)[source]

Shut down flambe in the host

create_dirs(self, relative_dirs: List[str])[source]

Create the necessary folders in the host.

Parameters:relative_dirs (List[str]) – The directories to create. They should be relative paths and $HOME of each host will be used to add the prefix.
remove_dir(self, _dir: str, content_only: bool = True)[source]

Delete the specified dir result folder.

Parameters:
  • _dir (str) – The directory. It needs to be relative to the $HOME path as it will be prepended as a prefix.
  • content_only (bool) – If True, the folder itseld will not be erased.
contains_gpu(self)[source]

Return if this machine contains GPU.

This method will be used to possibly upgrade this factory to a GPUFactoryInstance.

class flambe.cluster.instance.instance.CPUFactoryInstance[source]

Bases: flambe.cluster.instance.instance.Instance

This class represents a CPU Instance in the Ray cluster.

CPU Factories are instances that can run only one worker (no GPUs available). This class is mostly useful debugging.

Factory instances will not keep any important information. All information is going to be sent to an orchestrator machine.

prepare(self)[source]

Prepare a CPU machine to be a worker node.

Checks if flambe is installed, and if not, installs it.

Raises:RemoteCommandError – In case any step of the preparing process fails.
launch_node(self, redis_address: str)[source]

Launch the ray worker node.

Parameters:redis_address (str) – The URL of the main node. Must be IP:port
Raises:RemoteCommandError – If not able to run node.
num_cpus(self)[source]

Return the number of CPUs this host contains.

num_gpus(self)[source]

Get the number of GPUs this host contains

Returns:The number of GPUs
Return type:int
Raises:RemoteCommandError – If command to get the number of GPUs fails.
class flambe.cluster.instance.instance.GPUFactoryInstance[source]

Bases: flambe.cluster.instance.instance.CPUFactoryInstance

This class represents an Nvidia GPU Factory Instance.

Factory instances will not keep any important information. All information is going to be sent to an Orchestrator machine.

prepare(self)[source]

Prepare a GPU instance to run a ray worker node. For this, it installs CUDA and flambe if not installed.

Raises:RemoteCommandError – In case any step of the preparing process fails.
install_cuda(self)[source]

Install CUDA 10.0 drivers in an Ubuntu 18.04 distribution.

Raises:RemoteCommandError – If it’s not able to install drivers. ie if script fails
is_cuda_installed(self)[source]

Check if CUDA is installed trying to execute nvidia-smi

Returns:True if CUDA is installed. False otherwise.
Return type:bool
class flambe.cluster.instance.instance.OrchestratorInstance[source]

Bases: flambe.cluster.instance.instance.Instance

The orchestrator instance will be the main machine in a cluster.

It is going to be the main node in the ray cluster and it will also host other services. TODO: complete

All services besides ray will run in docker containers.

This instance does not needs to be a GPU machine.

prepare(self)[source]

Install docker and flambe

Raises:RemoteCommandError – In case any step of the preparing process fails.
launch_report_site(self, progress_file: str, port: int, output_log: str, output_dir: str, tensorboard_port: int)[source]

Launch the report site.

The report site is a Flask web app.

Raises:RemoteCommandError – In case the launch process fails
is_tensorboard_running(self)[source]

Return wether tensorboard is running in the host as docker.

Returns:True if Tensorboard is running, False otherwise.
Return type:bool
is_report_site_running(self)[source]

Return wether the report site is running in the host

Returns:
Return type:bool
remove_tensorboard(self)[source]

Removes tensorboard from the orchestrator.

remove_report_site(self)[source]

Remove report site from the orchestrator.

launch_tensorboard(self, logs_dir: str, tensorboard_port: int)[source]

Launch tensorboard.

Parameters:
  • logs_dir (str) – Tensorboard logs directory
  • tensorboard_port (int) – The port where tensorboard will be available
Raises:

RemoteCommandError – In case the launch process fails

existing_tmux_session(self, session_name: str)[source]

Return if there is an existing tmux session with the same name

Parameters:session_name (str) – The exact name of the searched tmux session
Returns:
Return type:bool
kill_tmux_session(self, session_name: str)[source]

Kill an existing tmux session

Parameters:session_name (str) – The exact name of the tmux session to be removed
launch_flambe(self, config_file: str, secrets_file: str, force: bool)[source]

Launch flambe execution in the remote host

Parameters:
  • config_file (str) – The config filename relative to the orchestrator
  • secrets_file (str) – The filepath containing the secrets for the orchestrator
  • force (bool) – The force parameters that was originally passed to flambe
launch_node(self, port: int)[source]

Launch the main ray node in given sftp server in port 49559.

Parameters:port (int) – Available port to launch the redis DB of the main ray node
Raises:RemoteCommandError – In case the launch process fails
worker_nodes(self)[source]

Returns the list of worker nodes

Returns:The list of worker nodes identified by their hostname
Return type:List[str]
rsync_folder(self, _from, _to, exclude=None)[source]

Rsyncs folders or files.

One of the folders NEEDS to be local. The remaining one can be remote if needed.