flambe.field.bow

Module Contents

class flambe.field.bow.BoWField(tokenizer: Optional[Tokenizer] = None, lower: bool = False, unk_token: str = '<unk>', min_freq: int = 5, normalize: bool = False, scale_factor: float = None)[source]

Bases: flambe.field.Field

Featurize raw text inputs using bag of words (BoW)

This class performs tokenization and numericalization.

The pad, unk, when given, are assigned the first indices in the vocabulary, in that order. This means, that whenever a pad token is specified, it will always use the 0 index.

Examples

>>> f = BoWField(min_freq=2, normalize=True)
>>> f.setup(['thank you', 'thank you very much', 'thanks a lot'])
>>> f._vocab.keys()
['thank', you']

Note that ‘thank’ and ‘you’ are the only ones that appear twice.

>>> f.process("thank you really. You help was awesome")
tensor([1, 2])
vocab_size :int[source]

Get the vocabulary length.

Returns:The length of the vocabulary
Return type:int
process(self, example)[source]
setup(self, *data)[source]