Oven logo

Oven

Published

A wrapper around the stdlib `tokenize` which roundtrips.

pip install tokenize-rt

Package Downloads

Weekly DownloadsMonthly Downloads

Project URLs

Requires Python

>=3.9

Dependencies

    build status pre-commit.ci status

    tokenize-rt

    The stdlib tokenize module does not properly roundtrip. This wrapper around the stdlib provides two additional tokens ESCAPED_NL and UNIMPORTANT_WS, and a Token data type. Use src_to_tokens and tokens_to_src to roundtrip.

    This library is useful if you're writing a refactoring tool based on the python tokenization.

    Installation

    pip install tokenize-rt
    

    Usage

    datastructures

    tokenize_rt.Offset(line=None, utf8_byte_offset=None)

    A token offset, useful as a key when cross referencing the ast and the tokenized source.

    tokenize_rt.Token(name, src, line=None, utf8_byte_offset=None)

    Construct a token

    • name: one of the token names listed in token.tok_name or ESCAPED_NL or UNIMPORTANT_WS
    • src: token's source as text
    • line: the line number that this token appears on.
    • utf8_byte_offset: the utf8 byte offset that this token appears on in the line.

    tokenize_rt.Token.offset

    Retrieves an Offset for this token.

    converting to and from Token representations

    tokenize_rt.src_to_tokens(text: str) -> List[Token]

    tokenize_rt.tokens_to_src(Iterable[Token]) -> str

    additional tokens added by tokenize-rt

    tokenize_rt.ESCAPED_NL

    tokenize_rt.UNIMPORTANT_WS

    helpers

    tokenize_rt.NON_CODING_TOKENS

    A frozenset containing tokens which may appear between others while not affecting control flow or code:

    • COMMENT
    • ESCAPED_NL
    • NL
    • UNIMPORTANT_WS

    tokenize_rt.parse_string_literal(text: str) -> Tuple[str, str]

    parse a string literal into its prefix and string content

    >>> parse_string_literal('f"foo"')
    ('f', '"foo"')
    

    tokenize_rt.reversed_enumerate(Sequence[Token]) -> Iterator[Tuple[int, Token]]

    yields (index, token) pairs. Useful for rewriting source.

    tokenize_rt.rfind_string_parts(Sequence[Token], i) -> Tuple[int, ...]

    find the indices of the string parts of a (joined) string literal

    • i should start at the end of the string literal
    • returns () (an empty tuple) for things which are not string literals
    >>> tokens = src_to_tokens('"foo" "bar".capitalize()')
    >>> rfind_string_parts(tokens, 2)
    (0, 2)
    >>> tokens = src_to_tokens('("foo" "bar").capitalize()')
    >>> rfind_string_parts(tokens, 4)
    (1, 3)
    

    Differences from tokenize

    • tokenize-rt adds ESCAPED_NL for a backslash-escaped newline "token"
    • tokenize-rt adds UNIMPORTANT_WS for whitespace (discarded in tokenize)
    • tokenize-rt normalizes string prefixes, even if they are not parsed -- for instance, this means you'll see Token('STRING', "f'foo'", ...) even in python 2.
    • tokenize-rt normalizes python 2 long literals (4l / 4L) and octal literals (0755) in python 3 (for easier rewriting of python 2 code while running python 3).

    Sample usage