flambe.field.text

Module Contents

flambe.field.text.get_embeddings(embeddings: str, embeddings_format: str = 'glove', embeddings_binary: bool = False) → KeyedVectors[source]

Get the embeddings model and matrix used in the setup function

Parameters:
  • embeddings (Optional[str], optional) – Path to pretrained embeddings, by default None
  • embeddings_format (str, optional) – The format of the input embeddings, should be one of: ‘glove’, ‘word2vec’, ‘fasttext’ or ‘gensim’. The latter can be used to download embeddings hosted on gensim on the fly. See https://github.com/RaRe-Technologies/gensim-data for the list of available embedding aliases.
  • embeddings_binary (bool, optional) – Whether the input embeddings are provided in binary format, by default False
Returns:

The embeddings object specified by the parameters.

Return type:

KeyedVectors

class flambe.field.text.EmbeddingsInformation[source]

Bases: typing.NamedTuple

Information about an embedding model.

Parameters:
  • embeddings (str) – Path to pretrained embeddings or the embedding name in case format is gensim.
  • embeddings_format (str, optional) – The format of the input embeddings, should be one of: ‘glove’, ‘word2vec’, ‘fasttext’ or ‘gensim’. The latter can be used to download embeddings hosted on gensim on the fly. See https://github.com/RaRe-Technologies/gensim-data for the list of available embedding aliases.
  • embeddings_binary (bool, optional) – Whether the input embeddings are provided in binary format, by default False.
  • build_vocab_from_embeddings (bool) – Controls if all words from the optional provided embeddings will be added to the vocabulary and to the embedding matrix. Defaults to False.
  • unk_init_all (bool, optional) – If True, every token not provided in the input embeddings is given a random embedding from a normal distribution. Otherwise, all of them map to the ‘<unk>’ token.
  • drop_unknown (bool) – Whether to drop tokens that don’t have embeddings associated. Defaults to False. Important: this flag will only work when using embeddings.
embeddings :str[source]
embeddings_format :str = gensim[source]
embeddings_binary :bool = False[source]
build_vocab_from_embeddings :bool = False[source]
unk_init_all :bool = False[source]
drop_unknown :bool = False[source]
class flambe.field.text.TextField(tokenizer: Optional[Tokenizer] = None, lower: bool = False, pad_token: Optional[str] = '<pad>', unk_token: str = '<unk>', sos_token: Optional[str] = None, eos_token: Optional[str] = None, embeddings_info: Optional[EmbeddingsInformation] = None, embeddings: Optional[str] = None, embeddings_format: str = 'glove', embeddings_binary: bool = False, unk_init_all: bool = False, drop_unknown: bool = False, max_seq_len: Optional[int] = None, truncate_end: bool = False, setup_all_embeddings: bool = False, additional_special_tokens: Optional[List[str]] = None, vocabulary: Optional[Union[Iterable[str], str]] = None)[source]

Bases: flambe.field.Field

Featurize raw text inputs

This class performs tokenization and numericalization, as well as decorating the input sequences with optional start and end tokens.

When a vocabulary is passed during initialiazation, it is used to map the the words to indices. However, the vocabulary can also be generated from input data, through the setup method. Once a vocabulary has been built, this object can also be used to load external pretrained embeddings.

The pad, unk, sos and eos tokens, when given, are assigned the first indices in the vocabulary, in that order. This means, that whenever a pad token is specified, it will always use the 0 index.

vocab_list :List[str][source]

Get the list of tokens in the vocabulary.

Returns:The list of tokens in the vocabulary, ordered.
Return type:List[str]
vocab_size :int[source]

Get the vocabulary length.

Returns:The length of the vocabulary
Return type:int
_flatten_to_str(self, data_sample: Union[List, Tuple, Dict])[source]

Converts any nested data sample to a str

Used to build vocabs from complex file structures

Parameters:data_sample (Union[List, Tuple, Dict]) –
Returns:the flattened version, for vocab building
Return type:str
_build_vocab(self, *data: np.ndarray)[source]

Build the vocabulary for this object based on the special tokens and the data provided.

This method is safe to be called multiple times.

Parameters:*data (np.ndarray) – The data
_build_embeddings(self, model: KeyedVectors, setup_vocab_from_embeddings: bool, initialize_unknowns: bool)[source]

Create the embeddings matrix and the new vocabulary in case this objects needs to use an embedding model.

A new vocabulary needs to be built because of the parameters that could allow, for example, collapsing OOVs.

Parameters:
  • model (KeyedVectors) – The embeddings
  • setup_vocab_from_embeddings (bool) – Controls if all words from the optional provided embeddings will be added to the vocabulary and to the embedding matrix. Defaults to False.
  • initialize_unknowns – If True, every unknown token will be assigned a random embedding from a normal distribution. Otherwise, all of them map to the ‘<unk>’ token.
Returns:

A tuple with the new embeddings and the embedding matrix

Return type:

Tuple[OrderedDict, torch.Tensor]

setup(self, *data: np.ndarray)[source]

Build the vocabulary and sets embeddings.

Parameters:data (Iterable[str]) – List of input strings.
process(self, example: Union[str, Tuple[Any], List[Any], Dict[Any, Any]])[source]

Process an example, and create a Tensor.

Parameters:example (str) – The example to process, as a single string
Returns:The processed example, tokenized and numericalized
Return type:torch.Tensor
classmethod from_embeddings(cls, embeddings: str, embeddings_format: str = 'glove', embeddings_binary: bool = False, build_vocab_from_embeddings: bool = False, unk_init_all: bool = False, drop_unknown: bool = False, additional_special_tokens: Optional[List[str]] = None, **kwargs)[source]

Optional constructor to create TextField from embeddings params.

Parameters:
  • embeddings (Optional[str], optional) – Path to pretrained embeddings, by default None
  • embeddings_format (str, optional) – The format of the input embeddings, should be one of: ‘glove’, ‘word2vec’, ‘fasttext’ or ‘gensim’. The latter can be used to download embeddings hosted on gensim on the fly. See https://github.com/RaRe-Technologies/gensim-data for the list of available embedding aliases.
  • embeddings_binary (bool, optional) – Whether the input embeddings are provided in binary format, by default False
  • build_vocab_from_embeddings (bool) – Controls if all words from the optional provided embeddings will be added to the vocabulary and to the embedding matrix. Defaults to False.
  • unk_init_all (bool, optional) – If True, every token not provided in the input embeddings is given a random embedding from a normal distribution. Otherwise, all of them map to the ‘<unk>’ token.
  • drop_unknown (bool) – Whether to drop tokens that don’t have embeddings associated. Defaults to False. Important: this flag will only work when using embeddings.
  • additional_special_tokens (Optional[List[str]]) – Additional tokens that have a reserved interpretation in the context of the current experiment, and that should therefore never be treated as “unknown”. Passing them in here will make sure that they will have their own embedding that can be trained.
Returns:

The constructed text field with the requested model.

Return type:

TextField