flambe.field.text

Module Contents

class flambe.field.text.TextField(tokenizer: Optional[Tokenizer] = None, lower: bool = False, pad_token: Optional[str] = '<pad>', unk_token: Optional[str] = '<unk>', sos_token: Optional[str] = None, eos_token: Optional[str] = None, embeddings: Optional[str] = None, embeddings_format: str = 'glove', embeddings_binary: bool = False, unk_init_all: bool = False, drop_unknown: bool = False, max_seq_len: Optional[int] = None, truncate_end: bool = False)[source]

Bases: flambe.field.Field

Featurize raw text inputs

This class performs tokenization and numericalization, as well as decorating the input sequences with optional start and end tokens.

When a vocabulary is passed during initialiazation, it is used to map the the words to indices. However, the vocabulary can also be generated from input data, through the setup method. Once a vocabulary has been built, this object can also be used to load external pretrained embeddings.

The pad, unk, sos and eos tokens, when given, are assigned the first indices in the vocabulary, in that order. This means, that whenever a pad token is specified, it will always use the 0 index.

vocab_size :int[source]

Get the vocabulary length.

Returns:The length of the vocabulary
Return type:int
setup(self, *data: np.ndarray)[source]

Build the vocabulary and sets embeddings.

Parameters:data (Iterable[str]) – List of input strings.
process(self, example: str)[source]

Process an example, and create a Tensor.

Parameters:example (str) – The example to process, as a single string
Returns:The processed example, tokenized and numericalized
Return type:torch.Tensor