TextField(tokenizer: Optional[Tokenizer] = None, lower: bool = False, pad_token: Optional[str] = '<pad>', unk_token: Optional[str] = '<unk>', sos_token: Optional[str] = None, eos_token: Optional[str] = None, embeddings: Optional[str] = None, embeddings_format: str = 'glove', embeddings_binary: bool = False, unk_init_all: bool = False, drop_unknown: bool = False)¶
Featurize raw text inputs
This class performs tokenization and numericalization, as well as decorating the input sequences with optional start and end tokens.
When a vocabulary is passed during initialiazation, it is used to map the the words to indices. However, the vocabulary can also be generated from input data, through the setup method. Once a vocabulary has been built, this object can also be used to load external pretrained embeddings.
The pad, unk, sos and eos tokens, when given, are assigned the first indices in the vocabulary, in that order. This means, that whenever a pad token is specified, it will always use the 0 index.
Get the vocabulary length.
Returns: The length of the vocabulary Return type: int
setup(self, *data: np.ndarray)¶
Build the vocabulary and sets embeddings.
Parameters: data (Iterable[str]) – List of input strings.
process(self, example: str)¶
Process an example, and create a Tensor.
Parameters: example (str) – The example to process, as a single string Returns: The processed example, tokenized and numericalized Return type: torch.Tensor