flambe.dataset

Package Contents

class flambe.dataset.Dataset[source]

Bases: flambe.Component

Base Dataset interface.

Dataset objects offer the main interface to loading data into the experiment pipepine. Dataset objects have three attributes: train, dev, and test, each pointing to a list of examples.

Note that Datasets should also be “immutable”, and as such, __setitem__ and __delitem__ will raise an error. Although this does not mean that the object will not be mutated in other ways, it should help avoid issues now and then.

train :Sequence[Sequence]

Returns the training data as a sequence of examples.

val :Sequence[Sequence]

Returns the validation data as a sequence of examples.

test :Sequence[Sequence]

Returns the test data as a sequence of examples.

__setitem__(self)

Raise an error.

__delitem__(self)

Raise an error.

class flambe.dataset.TabularDataset(train: Iterable[Iterable], val: Optional[Iterable[Iterable]] = None, test: Optional[Iterable[Iterable]] = None, cache: bool = True, named_columns: Optional[List[str]] = None, transform: Dict[str, Union[Field, Dict]] = None)[source]

Bases: flambe.dataset.Dataset

Loader for tabular data, usually in csv or tsv format.

A TabularDataset can represent any data that can be organized in a table. Internally, we store all information in a 2D numpy generic array. This object also behaves as a sequence over the whole dataset, chaining the training, validation and test data, in that order. This is useful in creating vocabularies or loading embeddings over the full datasets.

train

The list of training examples

Type:np.ndarray
val

The list of validation examples

Type:np.ndarray
test

The list of text examples

Type:np.ndarray
train :np.ndarray

Returns the training data as a numpy nd array

val :np.ndarray

Returns the validation data as a numpy nd array

test :np.ndarray

Returns the test data as a numpy nd array

raw :np.ndarray

Returns all partitions of the data as a numpy nd array

cols :int

Returns the amount of columns in the tabular dataset

_set_transforms(self, transform: Dict[str, Union[Field, Dict]])

Set transformations attributes and hooks to the data splits.

This method adds attributes for each field in the transform dict. It also adds hooks for the ‘process’ call in each field.

ATTENTION: This method works with the _train, _val and _test hidden attributes as this runs in the constructor and creates the hooks to be used in creating the properties.

classmethod from_path(cls, train_path: str, val_path: Optional[str] = None, test_path: Optional[str] = None, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8', transform: Dict[str, Union[Field, Dict]] = None)

Load a TabularDataset from the given file paths.

Parameters:
  • train_path (str) – The path to the train data
  • val_path (str, optional) – The path to the optional validation data
  • test_path (str, optional) – The path to the optional test data
  • sep (str) – Separator to pass to the read_csv method
  • header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
  • columns (List[str]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
  • encoding (str) – The encoding format passed to the pandas reader
  • transform (Dict[str, Union[Field, Dict]]) – The fields to be applied to the columns. Each field is identified with a name for easy linking.
classmethod autogen(cls, data_path: str, test_path: Optional[str] = None, seed: Optional[int] = None, test_ratio: Optional[float] = 0.2, val_ratio: Optional[float] = 0.2, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8', transform: Dict[str, Union[Field, Dict]] = None)

Generate a test and validation set from the given file paths, then load a TabularDataset.

Parameters:
  • data_path (str) – The path to the data
  • test_path (Optional[str]) – The path to the test data
  • seed (Optional[int]) – Random seed to be used in test/val generation
  • test_ratio (Optional[float]) – The ratio of the test dataset in relation to the whole dataset. If test_path is specified, this field has no effect.
  • val_ratio (Optional[float]) – The ratio of the validation dataset in relation to the training dataset (whole - test)
  • sep (str) – Separator to pass to the read_csv method
  • header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
  • columns (List[str]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
  • encoding (str) – The encoding format passed to the pandas reader
  • transform (Dict[str, Union[Field, Dict]]) – The fields to be applied to the columns. Each field is identified with a name for easy linking.
classmethod _load_file(cls, path: str, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8')

Load data from the given path.

The path may be either a single file or a directory. If it is a directory, each file is loaded according to the specified options and all the data is concatenated into a single list. The files will be processed in order based on file name.

Parameters:
  • path (str) – Path to data, could be a directory, a file, or a smart_open link
  • sep (str) – Separator to pass to the read_csv method
  • header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
  • columns (Optional[Union[List[str], List[int]]]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
  • encoding (str) – The encoding format passed to the pandas reader
Returns:

A tuple containing the list of examples (where each example is itself also a list or tuple of entries in the dataset) and an optional list of named columns (one string for each column in the dataset)

Return type:

Tuple[List[Tuple], Optional[List[str]]]

__len__(self)

Get the length of the dataset.

__iter__(self)

Iterate through the dataset.

__getitem__(self, index)

Get the item at the given index.