flambe.cluster.aws

Implementation of a Cluster with AWS EC2 as the cloud provider

Module Contents

flambe.cluster.aws.logger[source]
flambe.cluster.aws.T[source]
class flambe.cluster.aws.AWSCluster(name: str, factories_num: int, factories_type: str, orchestrator_type: str, key_name: str, security_group: str, subnet_id: str, creator: str, key: str, username: str = 'ubuntu', tags: Dict[str, str] = None, orchestrator_ami: str = None, factory_ami: str = None, dedicated: bool = False, orchestrator_timeout: int = -1, factories_timeout: int = 1, volume_size: int = 100, setup_cmds: Optional[List[str]] = None)[source]

Bases: flambe.cluster.cluster.Cluster

This Cluster implementation uses AWS EC2 as the cloud provider.

This cluster works with AWS Instances that are defined in: flambe.remote.instance.aws

Parameters:
  • name (str) – The unique name for the cluster
  • factories_num (int) – The amount of factories to use. This is not the amount of workers, as each factories can contain multiple GPUs and therefore, multiple workers.
  • factories_type (str) – The type of instance to use for the Factory Instances. GPU instances are required for AWS the AWSCluster. “p2” and “p3” instances are recommended.
  • factory_ami (str) – The AMI to be used for the Factory instances. Custom Flambe AMI are provided based on Ubuntu 18.04 distribution.
  • orchestrator_type (str) – The type of instance to use for the Orchestrator Instances. This may not be a GPU instances. At least a “t2.small” instance is recommended.
  • key_name (str) – The key name that will be used to connect into the instance.
  • creator (str) – The creator should be a user identifier for the instances. This information will create a tag called ‘creator’ and it will also be used to retrieve existing hosts owned by the user.
  • key (str) – The path to the ssh key used to communicate to all instances. IMPORTANT: all instances must be accessible with the same key.
  • username (str) – The username of the instances the cluster will handle. Defaults to ‘ubuntu’. IMPORTANT: for now all instances need to have the same username.
  • tags (Dict[str, str]) – A dictionary with tags that will be added to all created hosts.
  • security_group (str) – The security group to use to create the instances.
  • subnet_id (str) – The subnet ID to use.
  • orchestrator_ami (str) – The AMI to be used for the Factory instances. Custom Flambe AMI are provided based on Ubuntu 18.04 distribution.
  • dedicated (bool) – Wether all created instances are dedicated instances or shared.
  • orchestrator_timeout (int) – Number of consecutive hours before terminating the orchestrator once the experiment is over (either success of failure). Specify -1 to disable automatic shutdown (the orchestrator will stay on until manually terminated) and 0 to shutdown when the experiment is over. For example, if specifying 24, then the orchestrator will be shut down one day after the experiment is over. ATTENTION: This also applies when the experiment ends with an error. Default is -1.
  • factories_timeout (int) – Number of consecutive hours to automatically terminate factories once the experiment is over (either success or failure). Specify -1 to disable automatic shutdown (the factories will stay on until manually terminated) and 0 to shutdown when the experiment is over. For example, if specifying 10, then the factories will be shut down 10 hours after the experiment is over. ATTENTION: This also applies when the experiment ends with an error. Default is 1.
  • volume_size (int) – The disk size in GB that all hosts will contain. Defaults to 100 GB.
  • setup_cmds (Optional[List[str]]) – A list of commands to be run on all hosts for setup purposes. These commands can be used to mount volumes, install software, etc. Defaults to None. IMPORTANT: the commands need to be idempotent and they shouldn’t expect user input.
_load_boto_apis(self)[source]

Load the ec2 and cloudwatch apis.

This method is called by the contructor.

load_all_instances(self)[source]

Launch all instances for the experiment.

This method launches both the orchestrator and the factories.

_existing_cluster(self)[source]

Whether there is an existing cluster that matches name.

The cluster should also match all other tags, including Creator)

Returns:Returns the (boto_orchestrator, [boto_factories]) that match the experiment’s name.
Return type:Tuple[Any, List[Any]]
_get_tags(self, boto_instance: boto3.resources.factory.ec2.Instance)[source]

Gets the tags of a EC2 instances

Parameters:boto_instance (BotoIns) – The EC2 instance to access the tags.
Returns:Key, Value for the specified tags.
Return type:Dict[str, str]
flambe_own_running_instances(self)[source]

Get running instances with matching tags.

Yields:Tuple[‘boto3.resources.factory.ec2.Instance’, str] – A tuple with the instance and the name of the EC2 instance.
name_hosts(self)[source]

Name the orchestrator and factories.

update_tags(self)[source]

Update user provided tags to all hosts.

In case there is an existing cluster that do not contain all the tags, by executing this all hosts will have the user specified tags.

This won’t remove existing tags in the hosts.

_update_tags(self, boto_instance: boto3.resources.factory.ec2.Instance, tags: Dict[str, str])[source]

Create/Overwrite tags on an EC2 instance

Parameters:
  • boto_instance ('boto3.resources.factory.ec2.Instance') – The EC2 instance
  • tags (Dict[str, str]) – The tags to create/overwrite
name_instance(self, boto_instance: boto3.resources.factory.ec2.Instance, name: str)[source]

Renames a EC2 instance

Parameters:
  • boto_instance ('boto3.resources.factory.ec2.Instance') – The EC2 instance
  • name (str) – The new name
_create_orchestrator(self)[source]

Create a new EC2 instance to be the Orchestrator instance.

This new machine receives all tags defined in the *.ini file.

Returns:The new orchestrator instance.
Return type:instance.AWSOrchestratorInstance
_create_factories(self, number: int = 1)[source]

Creates new AWS EC2 instances to be the Factory instances.

These new machines receive all tags defined in the *.ini file. Factory instances will be named using the factory basename plus an index. For example, “seq2seq_factory_0”, “seq2seq_factory_1”.

Parameters:number (int) – The number of factories to be created.
Returns:The new factory instances.
Return type:List[instance.AWSGPUFactoryInstance]
_generic_launch_instances(self, instance_class: Type[T], number: int, instance_type: str, instance_ami: str, role: str)[source]

Generic method to launch instances in AWS EC2 using boto3.

This method should not be used outside this module.

Parameters:
  • instance_class (Type[T]) – The instance class. It can be AWSOrchestratorInstance or AWSGPUFactoryInstance.
  • number (int) – The amount of instances to create
  • instance_type (str) – The instance type
  • instance_ami (str) – The AMI to be used. Should be an Ubuntu 18.04 based AMI.
  • role (str) – Wether is ‘Orchestrator’ or ‘Factory’
Returns:

The new Instances.

Return type:

List[Union[AWSOrchestratorInstance, AWSGPUFactoryInstance]]

terminate_instances(self)[source]

Terminates all instances.

rollback_env(self)[source]

Rollback the environment.

This occurs when an error is caucht during the local stage of the remote experiment (i.e. creating the cluster, sending the data and submitting jobs), this method handles cleanup stages.

parse(self)[source]

Checks if the AWSCluster configuration is valid.

This checks that the factories are never terminated after the orchestrator is. Avoids the scenario where the cluster has only factories and no orchestrator, which is useless.

Raises:errors.ClusterConfigurationError – If configuration is not valid.
_get_boto_instance_by_host(self, public_host: str)[source]

Returns the instance id given the public host

Parameters:public_host (str) – The host in IP format of DNS format
Returns:The id if found else None
Return type:Optional[boto3.resources.factory.ec2.Instance]
_get_instance_id_by_host(self, public_host: str)[source]

Returns the instance id given the public host

Parameters:public_host (str) – The host in IP format of DNS format
Returns:The id if found else None
Return type:Optional[str]
_get_alarm_name(self, instance_id: str)[source]

Get the alarm name to be used for the given instance.

Parameters:instance_id (str) – The id of the instance
Returns:The name of the corresponding alarm
Return type:str
has_alarm(self, instance_id: str)[source]

Whether the instance has an alarm set.

Parameters:instance_id (str) – The id of the instance
Returns:True if an alarm is set. False otherwise.
Return type:bool
remove_existing_events(self)[source]

Remove the current alarm.

In case the orchestrator or factories had an alarm, we remove it to reset the new policies.

create_cloudwatch_events(self)[source]

Creates cloudwatch events for orchestrator and factories.

_delete_cloudwatch_event(self, instance_id: str)[source]

Deletes the alarm related to the instance.

_create_cloudwatch_event(self, instance_id: str, mins: int = 60, cpu_thresh: float = 0.1)[source]

Create CloudWatch alarm.

The alrm is used to terminate an instance based on CPU usage.

Parameters:
  • instance_id (str) – The ID of the EC2 instance
  • mins (int) – Number of minutes to trigger the termination event. The evaluation preriod will be always one minute.
  • cpu_thresh (float) – Percentage specifying upper bound for triggering event. If mins is 60 and cpu_thresh is 0.1, then this instance will be deleted after 1 hour of average CPU below 0.1.