BoWField(tokenizer: Optional[Tokenizer] = None, lower: bool = False, unk_token: str = '<unk>', min_freq: int = 5, normalize: bool = False, scale_factor: float = None)¶
Featurize raw text inputs using bag of words (BoW)
This class performs tokenization and numericalization.
The pad, unk, when given, are assigned the first indices in the vocabulary, in that order. This means, that whenever a pad token is specified, it will always use the 0 index.
>>> f = BoWField(min_freq=2, normalize=True) >>> f.setup(['thank you', 'thank you very much', 'thanks a lot']) >>> f._vocab.keys() ['thank', you']
Note that ‘thank’ and ‘you’ are the only ones that appear twice.
>>> f.process("thank you really. You help was awesome") tensor([1, 2])
Get the vocabulary length.
Returns: The length of the vocabulary Return type: int