Flambé supports running remote
Runnables where jobs can be distributed across a cluster of workers.
Overall remote architecture¶
Flambé will create the following cluster when running a
The Orchestrator is the main machine in the cluster. The Orchestrator might host websites, run docker containers, etc. It can also collect artifacts like checkpoints or logs.
This machine doesn’t need to contain a GPU as it does not perform heavy computations.
The factories are instances that are capable of doing heavy computational work and likely need to have GPU resources
(for example, if you’re running an
Experiment with PyTorch and CUDA).
Orchestrator and Factories have private SSH connection with a pair of keys that are create and
distributed specially for the specific
Cluster. More information about this
Launching a cluster¶
Cluster is a special type of
implementation that handles clusters of machines (e.g. AWS instances) that are capable
of running distributed jobs. As with any
Runnable, you can run a cluster by executing
flambé with the YAML config as an argument:
Cluster is an abstract class because it depends on the cloud service provider, so users
will need to use one of the provided implementations or create a custom one by overriding the abstract methods.
We currently provide a full cluster implementation for AWS; see Using AWS
Setting the cluster up¶
All implementations of
Cluster support setting
which are a list of
bash commands that will run on all instances after creating the cluster:
!XXXCluster name: my_cluster ... setup_cmds: - sshfs [email protected]:/path/to/remote /path/to/local/mount/point # Mount a remote filesystem - pip config set index_url https://my-custom-pypi.com # Configure PyPI
Note that all commands will run sequentially in all the hosts of the cluster.
This could be useful for mounting volumes, configuring tools or install binaries.
Submitting Jobs to a Cluster¶
Given an experiment.yaml config file, running it remotely is as easy as:
flambe experiment.yaml --cluster cluster.yaml [--force]
Flambé will take care of preparing the cluster to run the
ClusterRunnable (in this case an
--force option is necessary when an existing execution is taking place in the
same cluster and the user wants to override it.
There is no need to run
flambe cluster.yaml before running a
ClusterRunnable in it.
If it’s the first time using the cluster, flambé will create it for you!
We provide full AWS integration using the
AWSCluster implementation. When using this cluster,
flambé will take care of:
- Building the cluster
- Preparing all instances (e.g. installing the version of flambé that matches what the user has locally)
- Automatically shutting the cluster down (if specified)
How to use AWSCluster?
!AWSCluster name: name-of-the-cluster # Pick a unique identifier for the cluster factories_num: 1 # The amount of factories factories_type: g3.4xlarge # The type of factories. GPU instances are recommended. orchestrator_type: t3.large # The type of the orchestrator (GPU is not necessary). orchestrator_timeout: -1 # # -1 means the orchestrator will have to be killed manually (recommended) factories_timeout: -1 # Factories timeout after being unused for these many hours creator: [email protected] key_name: aws-key-name tags: # Extra tags to add to all instances company: my-company key: /path/to/ssh/key subnet_id: subnet-abcdef volume_size: 100. # GBs of disk space for all instances security_group: sg-0987654321
For a full description, see
AWSCluster implementation provides a way of automatically shutting down all instances that have been created:
!AWSCluster # rest of manager config orchestrator_timeout: 5 factories_timeout: 0
These parameters specify how many hours the resources will persist with low CPU consumption.
In the above example, the Orchestrator will be terminated after 5 hours of low CPU usage. The Factories will be terminated as soon as CPU usage goes down.
-1 to keep the resources alive permanently, or until you manually stop them.
For a full example of a configuration file for a Cluster, go here.
ClusterRunnables remotely, the correct version of Flambé will be installed automatically,
i.e. the version being used locally. For example, if the user has
flambe==1.2 installed locally,
then all instances (orchestrator and factories) will be using version
This is also valid in developer mode. More on developer mode in Debugging.
ClusterRunnable is a special implementation of a
that is able to execute on a flambé cluster.
Experiment object, for example, is a
Users are able to create custom
ClusterRunnables by implementing its interface
(which extends from the
Runnable interface as well).
This new interface requires an additional implementation for the
1 2 3 4 5 6 7 8
from flambe.runnable import ClusterRunnable class MyClusterRunnable(ClusterRunnable): def setup(self, cluster: Cluster, extensions: Dict[str, str], force: bool, **kwargs) -> None: # code to setup the cluster
setup() method should prepare the cluster (which is received as a parameter)
to run the
Runnable remotely. This usually involves creating folders, downloading resources,
running docker containers, etc.
Cluster implementations provides basic functionality that allow directory creation, running bash commands,
rsyncing folders, running docker containers and much more. See its documentation for more information about this.
It’s highly likely that you will need to change some instance attributes in the object in the
For doing this, you should use
set_serializable_attr() to ensure that the attribute change is serializable.
How to run a ClusterRunnable
For running a
ClusterRunnable remotely, you will need to provide a cluster configuration:
flambe cluster_runnable.yaml --cluster cluster.yaml
Because of being
Runnable, it can still be executed locally:
Users can get the most of performance by running
Experiments in a
Experiments, a ray cluster will be created connecting all instances
in the cluster. The Orchestrator will host Tensorboard and the Report Site (the URL will
be provided in the console) and the Factories will do the heavy work executing the
Additionally, when running remote
Experiments, flambé will take care of uploading the local resources
that were specified, making them available to all instances.