flambe.dataset.tabular

Module Contents

class flambe.dataset.tabular.DataView(data: np.ndarray, transform_hooks: List[Tuple[Field, Union[int, List[int]]]], cache: bool)[source]

TabularDataset view for the train, val or test split. This class must be used only internally in the TabularDataset class.

A DataView is a lazy Iterable that receives the operations from the TabularDataset object. When __getitem__ is called, then all the fields defined in the transform are applied.

This object can cache examples already transformed. To enable this, make sure to use this view under a Singleton pattern (there must only be one DataView per split in the TabularDataset).

raw[source]

Returns an subscriptable version of the data

__getitem__(self, index)[source]

Get an item from an index and apply the transformations dinamically.

is_empty(self)[source]

Return if the DataView has data

cols(self)[source]

Return the amount of columns the DataView has.

__len__(self)[source]

Return the length of the dataview, ie the amount of examples it contains.

__setitem__(self)[source]

Raise an error as DataViews are immutable.

__delitem__(self)[source]

Raise an error as DataViews are immutable.

class flambe.dataset.tabular.TabularDataset(train: Iterable[Iterable], val: Optional[Iterable[Iterable]] = None, test: Optional[Iterable[Iterable]] = None, cache: bool = True, named_columns: Optional[List[str]] = None, transform: Dict[str, Union[Field, Dict]] = None)[source]

Bases: flambe.dataset.Dataset

Loader for tabular data, usually in csv or tsv format.

A TabularDataset can represent any data that can be organized in a table. Internally, we store all information in a 2D numpy generic array. This object also behaves as a sequence over the whole dataset, chaining the training, validation and test data, in that order. This is useful in creating vocabularies or loading embeddings over the full datasets.

train[source]

The list of training examples

Type:np.ndarray
val[source]

The list of validation examples

Type:np.ndarray
test[source]

The list of text examples

Type:np.ndarray
train :np.ndarray[source]

Returns the training data as a numpy nd array

val :np.ndarray[source]

Returns the validation data as a numpy nd array

test :np.ndarray[source]

Returns the test data as a numpy nd array

raw :np.ndarray[source]

Returns all partitions of the data as a numpy nd array

cols :int[source]

Returns the amount of columns in the tabular dataset

_set_transforms(self, transform: Dict[str, Union[Field, Dict]])[source]

Set transformations attributes and hooks to the data splits.

This method adds attributes for each field in the transform dict. It also adds hooks for the ‘process’ call in each field.

ATTENTION: This method works with the _train, _val and _test hidden attributes as this runs in the constructor and creates the hooks to be used in creating the properties.

classmethod from_path(cls, train_path: str, val_path: Optional[str] = None, test_path: Optional[str] = None, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8', transform: Dict[str, Union[Field, Dict]] = None)[source]

Load a TabularDataset from the given file paths.

Parameters:
  • train_path (str) – The path to the train data
  • val_path (str, optional) – The path to the optional validation data
  • test_path (str, optional) – The path to the optional test data
  • sep (str) – Separator to pass to the read_csv method
  • header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
  • columns (List[str]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
  • encoding (str) – The encoding format passed to the pandas reader
  • transform (Dict[str, Union[Field, Dict]]) – The fields to be applied to the columns. Each field is identified with a name for easy linking.
classmethod autogen(cls, data_path: str, test_path: Optional[str] = None, seed: Optional[int] = None, test_ratio: Optional[float] = 0.2, val_ratio: Optional[float] = 0.2, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8', transform: Dict[str, Union[Field, Dict]] = None)[source]

Generate a test and validation set from the given file paths, then load a TabularDataset.

Parameters:
  • data_path (str) – The path to the data
  • test_path (Optional[str]) – The path to the test data
  • seed (Optional[int]) – Random seed to be used in test/val generation
  • test_ratio (Optional[float]) – The ratio of the test dataset in relation to the whole dataset. If test_path is specified, this field has no effect.
  • val_ratio (Optional[float]) – The ratio of the validation dataset in relation to the training dataset (whole - test)
  • sep (str) – Separator to pass to the read_csv method
  • header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
  • columns (List[str]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
  • encoding (str) – The encoding format passed to the pandas reader
  • transform (Dict[str, Union[Field, Dict]]) – The fields to be applied to the columns. Each field is identified with a name for easy linking.
classmethod _load_file(cls, path: str, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8')[source]

Load data from the given path.

The path may be either a single file or a directory. If it is a directory, each file is loaded according to the specified options and all the data is concatenated into a single list. The files will be processed in order based on file name.

Parameters:
  • path (str) – Path to data, could be a directory, a file, or a smart_open link
  • sep (str) – Separator to pass to the read_csv method
  • header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
  • columns (Optional[Union[List[str], List[int]]]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
  • encoding (str) – The encoding format passed to the pandas reader
Returns:

A tuple containing the list of examples (where each example is itself also a list or tuple of entries in the dataset) and an optional list of named columns (one string for each column in the dataset)

Return type:

Tuple[List[Tuple], Optional[List[str]]]

__len__(self)[source]

Get the length of the dataset.

__iter__(self)[source]

Iterate through the dataset.

__getitem__(self, index)[source]

Get the item at the given index.