flambe.tokenizer.subword

Module Contents

class flambe.tokenizer.subword.BPETokenizer(codes_path: str)[source]

Bases: flambe.tokenizer.Tokenizer

Implement a subword level tokenizer using byte pair encoding. Tokenization is done using fastBPE (https://github.com/glample/fastBPE) and requires a fastBPE codes file.

tokenize(self, example: str)[source]

Tokenize an input example.

Parameters:example (str) – The input example, as a string
Returns:The output subword tokens, as a list of strings
Return type:List[str]