Oven logo

Oven

mutf81.0.6

Published

Fast MUTF-8 encoder & decoder

pip install mutf8

Package Downloads

Weekly DownloadsMonthly Downloads

Project URLs

Requires Python

Dependencies

    Tests

    mutf-8

    This package contains simple pure-python as well as C encoders and decoders for the MUTF-8 character encoding. In most cases, you can also parse the even-rarer CESU-8.

    These days, you'll most likely encounter MUTF-8 when working on files or protocols related to the JVM. Strings in a Java .class file are encoded using MUTF-8, strings passed by the JNI, as well as strings exported by the object serializer.

    This library was extracted from Lawu, a Python library for working with JVM class files.

    🎉 Installation

    Install the package from PyPi:

    pip install mutf8
    

    Binary wheels are available for the following:

    py3.6py3.7py3.8py3.9
    OS X (x86_64)yyyy
    Windows (x86_64)yyyy
    Linux (x86_64)yyyy

    If binary wheels are not available, it will attempt to build the C extension from source with any C99 compiler. If it could not build, it will fall back to a pure-python version.

    Usage

    Encoding and decoding is simple:

    from mutf8 import encode_modified_utf8, decode_modified_utf8
    
    unicode = decode_modified_utf8(byte_like_object)
    bytes = encode_modified_utf8(unicode)
    

    This module does not register itself globally as a codec, since importing should be side-effect-free.

    📈 Benchmarks

    The C extension is significantly faster - often 20x to 40x faster.

    MUTF-8 Decoding

    NameMin (μs)Max (μs)StdDevOps
    cmutf8-decode_modified_utf80.000090.000800.000009957678.56358
    pymutf8-decode_modified_utf80.001900.060400.00000450455.96019

    MUTF-8 Encoding

    NameMin (μs)Max (μs)StdDevOps
    cmutf8-encode_modified_utf80.000080.001510.0000011897361.05101
    pymutf8-encode_modified_utf80.001800.166500.00000474390.98091

    C Extension

    The C extension is optional. If a binary package is not available, or a C compiler is not present, the pure-python version will be used instead. If you want to ensure you're using the C version, import it directly:

    from mutf8.cmutf8 import decode_modified_utf8
    
    decode_modified_utf(b'\xED\xA1\x80\xED\xB0\x80')