flambe.nlp.language_modeling

Package Contents

class flambe.nlp.language_modeling.PTBDataset(split_by_line: bool = False, end_of_line_token: Optional[str] = '<eol>', cache: bool = False, transform: Dict[str, Union[Field, Dict]] = None)[source]

Bases: flambe.dataset.TabularDataset

The official PTB dataset.

PTB_URL = https://raw.githubusercontent.com/yoonkim/lstm-char-cnn/master/data/ptb/
_process(self, file: bytes)

Process the input file.

Parameters:field (str) – The input file, as bytes
Returns:List of examples, where each example is a single element tuple containing the text.
Return type:List[Tuple[str]]
class flambe.nlp.language_modeling.Wiki103(split_by_line: bool = False, end_of_line_token: Optional[str] = '<eol>', remove_headers: bool = False, cache: bool = False, transform: Dict[str, Union[Field, Dict]] = None)[source]

Bases: flambe.dataset.TabularDataset

The official WikiText103 dataset.

WIKI_URL = https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
_process(self, file: bytes)

Process the input file.

Parameters:file (bytes) – The input file, as a byte string
Returns:List of examples, where each example is a single element tuple containing the text.
Return type:List[Tuple[str]]
class flambe.nlp.language_modeling.Enwiki8(num_eval_symbols: int = 5000000, remove_end_of_line: bool = False, cache: bool = False, transform: Dict[str, Union[Field, Dict]] = None)[source]

Bases: flambe.dataset.TabularDataset

The official WikiText103 dataset.

ENWIKI_URL = http://mattmahoney.net/dc/enwik8.zip
_process(self, file: bytes)

Process the input file.

Parameters:file (bytes) – The input file, as a byte string
Returns:List of examples, where each example is a single element tuple containing the text.
Return type:List[Tuple[str]]
class flambe.nlp.language_modeling.LMField(**kwargs)[source]

Bases: flambe.field.TextField

Language Model field.

Generates the original tensor alongside its shifted version.

process(self, example: str)

Process an example and create 2 Tensors.

Parameters:example (str) – The example to process, as a single string
Returns:The processed example, tokenized and numericalized
Return type:Tuple[torch.Tensor, ..]
class flambe.nlp.language_modeling.LanguageModel(embedder: Embedder, output_layer: Module, dropout: float = 0, pad_index: int = 0, tie_weights: bool = False, tie_weight_attr: str = 'embedding')[source]

Bases: flambe.nn.Module

Implement an LanguageModel model for sequential classification.

This model can be used to language modeling, as well as other sequential classification tasks. The full sequence predictions are produced by the model, effectively making the number of examples the batch size multiplied by the sequence length.

forward(self, data: Tensor, target: Optional[Tensor] = None)

Run a forward pass through the network.

Parameters:data (Tensor) – The input data
Returns:The output predictions of shape seq_len x batch_size x n_out
Return type:Union[Tensor, Tuple[Tensor, Tensor]]
class flambe.nlp.language_modeling.CorpusSampler(batch_size: int = 128, unroll_size: int = 128, n_workers: int = 0, pin_memory: bool = False, downsample: Optional[float] = None, drop_last: bool = True)[source]

Bases: flambe.sampler.sampler.Sampler

Implement a CorpusSampler object.

This object is useful for iteration over a large corpus of text in an ordered way. It takes as input a dataset with a single example containing the sequence of tokens and will yield batches that contain both source sequences of tensors corresponding to the Corpus’s text, and these same sequences shifted by one as the target.

static collate_fn(data: Sequence[Tuple[Tensor, Tensor]])

Create a batch from data.

Parameters:data (Sequence[Tuple[Tensor, Tensor]]) – List of (source, target) tuples.
Returns:Source and target Tensors.
Return type:Tuple[Tensor, Tensor]
sample(self, data: Sequence[Sequence[Tensor]], n_epochs: int = 1)

Sample from the list of features and yields batches.

Parameters:
  • data (Sequence[Sequence[Tensor, ..]]) – The input data to sample from
  • n_epochs (int, optional) – The number of epochs to run in the output iterator. Use -1 to run infinitely.
Yields:

Iterator[Tuple[Tensor]] – A batch of data, as a tuple of Tensors

length(self, data: Sequence[Sequence[torch.Tensor]])

Return the number of batches in the sampler.

Parameters:data (Sequence[Sequence[torch.Tensor, ..]]) – The input data to sample from
Returns:The number of batches that would be created per epoch
Return type:int