flambe.field

Package Contents

class flambe.field.Field[source]

Bases: flambe.Component

Base Field interface.

A field processes raw examples and produces Tensors.

setup(self, *data: np.ndarray)

Setup the field.

This method will be called with all the data in the dataset and it can be used to compute aggregated information (for example, vocabulary in Fields that process text).

ATTENTION: this method could be called multiple times in case the same field is used in different datasets. Take this into account and build a stateful implementation.

Parameters:*data (np.ndarray) – Multiple 2d arrays (ex: train_data, dev_data, test_data). First dimension is for the examples, second dimension for the columns specified for this specific field.
process(self, *example: Any)

Process an example into a Tensor or tuple of Tensor.

This method allows N to M mappings from example columns (N) to tensors (M).

Parameters:*example (Any) – Column values of the example
Returns:The processed example, as a tensor or tuple of tensors
Return type:Union[torch.Tensor, Tuple[torch.Tensor, ..]]
class flambe.field.TextField(tokenizer: Optional[Tokenizer] = None, lower: bool = False, pad_token: Optional[str] = '<pad>', unk_token: Optional[str] = '<unk>', sos_token: Optional[str] = None, eos_token: Optional[str] = None, embeddings: Optional[str] = None, embeddings_format: str = 'glove', embeddings_binary: bool = False, unk_init_all: bool = False)[source]

Bases: flambe.field.Field

Featurize raw text inputs

This class performs tokenization and numericalization, as well as decorating the input sequences with optional start and end tokens.

When a vocabulary is passed during initialiazation, it is used to map the the words to indices. However, the vocabulary can also be generated from input data, through the setup method. Once a vocabulary has been built, this object can also be used to load external pretrained embeddings.

The pad, unk, sos and eos tokens, when given, are assigned the first indices in the vocabulary, in that order. This means, that whenever a pad token is specified, it will always use the 0 index.

vocab_size :int

Get the vocabulary length.

Returns:The length of the vocabulary
Return type:int
setup(self, *data: np.ndarray)

Build the vocabulary and sets embeddings.

Parameters:data (Iterable[str]) – List of input strings.
process(self, example: str)

Process an example, and create a Tensor.

Parameters:example (str) – The example to process, as a single string
Returns:The processed example, tokenized and numericalized
Return type:torch.Tensor
class flambe.field.BoWField(tokenizer: Optional[Tokenizer] = None, lower: bool = False, unk_token: str = '<unk>', min_freq: int = 5, normalize: bool = False, scale_factor: float = None)[source]

Bases: flambe.field.Field

Featurize raw text inputs using bag of words (BoW)

This class performs tokenization and numericalization.

The pad, unk, when given, are assigned the first indices in the vocabulary, in that order. This means, that whenever a pad token is specified, it will always use the 0 index.

Examples

>>> f = BoWField(min_freq=2, normalize=True)
>>> f.setup(['thank you', 'thank you very much', 'thanks a lot'])
>>> f._vocab.keys()
['thank', you']

Note that ‘thank’ and ‘you’ are the only ones that appear twice.

>>> f.process("thank you really. You help was awesome")
tensor([1, 2])
vocab_size :int

Get the vocabulary length.

Returns:The length of the vocabulary
Return type:int
process(self, example)
setup(self, *data)
class flambe.field.LabelField(one_hot: bool = False, multilabel_sep: Optional[str] = None)[source]

Bases: flambe.field.text.TextField

Featurizes input labels.

The class also handles multilabel inputs and one hot encoding.

process(self, example)

Featurize a single example.

Parameters:example (str) – The input label
Returns:A list of integer tokens
Return type:torch.Tensor