`flambe.dataset`¶

Submodules¶

Package Contents¶

class flambe.dataset.Dataset[source]¶

Bases: flambe.Component

Base Dataset interface.

Dataset objects offer the main interface to loading data into the experiment pipepine. Dataset objects have three attributes: train, dev, and test, each pointing to a list of examples.

Note that Datasets should also be “immutable”, and as such, __setitem__ and __delitem__ will raise an error. Although this does not mean that the object will not be mutated in other ways, it should help avoid issues now and then.

train :Sequence[Sequence]¶: Returns the training data as a sequence of examples.

val :Sequence[Sequence]¶: Returns the validation data as a sequence of examples.

test :Sequence[Sequence]¶: Returns the test data as a sequence of examples.

__setitem__(self)¶: Raise an error.

__delitem__(self)¶: Raise an error.

class flambe.dataset.TabularDataset(train: Iterable[Iterable], val: Optional[Iterable[Iterable]] = None, test: Optional[Iterable[Iterable]] = None, cache: bool = True, named_columns: Optional[List[str]] = None, transform: Dict[str, Union[Field, Dict]] = None)[source]¶

Bases: flambe.dataset.Dataset

Loader for tabular data, usually in csv or tsv format.

A TabularDataset can represent any data that can be organized in a table. Internally, we store all information in a 2D numpy generic array. This object also behaves as a sequence over the whole dataset, chaining the training, validation and test data, in that order. This is useful in creating vocabularies or loading embeddings over the full datasets.

train¶

The list of training examples

Type:	np.ndarray

val¶

The list of validation examples

Type:	np.ndarray

test¶

The list of text examples

Type:	np.ndarray

train :np.ndarray: Returns the training data as a numpy nd array

val :np.ndarray: Returns the validation data as a numpy nd array

test :np.ndarray: Returns the test data as a numpy nd array

raw :np.ndarray¶: Returns all partitions of the data as a numpy nd array

cols :int¶: Returns the amount of columns in the tabular dataset

_set_transforms(self, transform: Dict[str, Union[Field, Dict]])¶

Set transformations attributes and hooks to the data splits.

This method adds attributes for each field in the transform dict. It also adds hooks for the ‘process’ call in each field.

ATTENTION: This method works with the _train, _val and _test hidden attributes as this runs in the constructor and creates the hooks to be used in creating the properties.

classmethod from_path(cls, train_path: str, val_path: Optional[str] = None, test_path: Optional[str] = None, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8', transform: Dict[str, Union[Field, Dict]] = None)¶

Load a TabularDataset from the given file paths.

Parameters:

train_path (str) – The path to the train data
val_path (str, optional) – The path to the optional validation data
test_path (str, optional) – The path to the optional test data
sep (str) – Separator to pass to the read_csv method
header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
columns (List[str]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
encoding (str) – The encoding format passed to the pandas reader
transform (Dict[str, Union[Field, Dict]]) – The fields to be applied to the columns. Each field is identified with a name for easy linking.

classmethod autogen(cls, data_path: str, test_path: Optional[str] = None, seed: Optional[int] = None, test_ratio: Optional[float] = 0.2, val_ratio: Optional[float] = 0.2, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8', transform: Dict[str, Union[Field, Dict]] = None)¶

Generate a test and validation set from the given file paths, then load a TabularDataset.

Parameters:

data_path (str) – The path to the data
test_path (Optional[str]) – The path to the test data
seed (Optional[int]) – Random seed to be used in test/val generation
test_ratio (Optional[float]) – The ratio of the test dataset in relation to the whole dataset. If test_path is specified, this field has no effect.
val_ratio (Optional[float]) – The ratio of the validation dataset in relation to the training dataset (whole - test)
sep (str) – Separator to pass to the read_csv method
header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
columns (List[str]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
encoding (str) – The encoding format passed to the pandas reader
transform (Dict[str, Union[Field, Dict]]) – The fields to be applied to the columns. Each field is identified with a name for easy linking.

classmethod _load_file(cls, path: str, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8')¶

Load data from the given path.

The path may be either a single file or a directory. If it is a directory, each file is loaded according to the specified options and all the data is concatenated into a single list. The files will be processed in order based on file name.

Parameters:	path (str) – Path to data, could be a directory, a file, or a smart_open link sep (str) – Separator to pass to the read_csv method header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’ columns (Optional[Union[List[str], List[int]]]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time encoding (str) – The encoding format passed to the pandas reader
Returns:	A tuple containing the list of examples (where each example is itself also a list or tuple of entries in the dataset) and an optional list of named columns (one string for each column in the dataset)
Return type:	Tuple[List[Tuple], Optional[List[str]]]

__len__(self)¶: Get the length of the dataset.

__iter__(self)¶: Iterate through the dataset.

__getitem__(self, index)¶: Get the item at the given index.

flambe.dataset¶

Submodules¶

Package Contents¶

`flambe.dataset`¶