Oven logo

Oven

Published

Japanese text normalizer for mecab-neologd

pip install neologdn

Package Downloads

Weekly DownloadsMonthly Downloads

Requires Python

Dependencies

    neologdn

    PyPI DownloadsPyPI - VersionPyPI - Python VersionPyPI - License

    neologdn is a Japanese text normalizer for mecab-neologd.

    The normalization is based on the neologd's rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja

    And also some optional features are added.

    Contributions are welcome!

    NOTE: Installing this module requires C++11 compiler.

    Installation

    pip install neologdn
    

    If setuptools is not installed, you must install it:

    pip install setuptools
    

    If you encountered the following error:

    ERROR: Could not find a version that satisfies the requirement setuptools (from versions: none)
    

    Then execute the following commands to may solve this error:

    pip install wheel
    pip install --no-build-isolation neologdn
    

    Usage

    import neologdn
    neologdn.normalize("ハンカクカナ")
    # => 'ハンカクカナ'
    neologdn.normalize("全角記号!?@#")
    # => '全角記号!?@#'
    neologdn.normalize("全角記号例外「・」")
    # => '全角記号例外「・」'
    neologdn.normalize("長音短縮ウェーーーーイ")
    # => '長音短縮ウェーイ'
    neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
    # => 'チルダ削除ウェイ'
    neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
    # => 'いろんなハイフン-'
    neologdn.normalize("   PRML  副 読 本   ")
    # => 'PRML副読本'
    neologdn.normalize(" Natural Language Processing ")
    # => 'Natural Language Processing'
    neologdn.normalize("かわいいいいいいいいい", repeat=6)
    # => 'かわいいいいいい'
    neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
    # => '無駄ァ'
    neologdn.normalize("1995〜2001年", tilde="normalize")
    # => '1995~2001年'
    neologdn.normalize("1995~2001年", tilde="normalize_zenkaku")
    # => '1995〜2001年'
    neologdn.normalize("1995〜2001年", tilde="ignore")  # Don't convert tilde
    # => '1995〜2001年'
    neologdn.normalize("1995〜2001年", tilde="remove")
    # => '19952001年'
    neologdn.normalize("1995〜2001年")  # Default parameter
    # => '19952001年'
    

    Benchmark

    
    # Sample code from
    # https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
    import normalize_neologd
    
    %timeit normalize(normalize_neologd.normalize_neologd)
    # => 9.55 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    import neologdn
    %timeit normalize(neologdn.normalize)
    # => 6.66 s ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    neologdn is about x1.43 faster than sample code.

    details are described as the below notebook: https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb

    License

    Apache Software License.

    CHANGES

    0.5.4 (2025-03-15)

    • Support Python 3.13
    • Fix tilde loss after latin and whitespace (Many thanks @a-lucky)

    0.5.3 (2024-05-03)

    • Support Python 3.12

    0.5.2 (2023-08-03)

    • Support Python 3.10 and 3.11 (Many thanks @polm)

    0.5.1 (2021-05-02)

    • Improve performance of shorten_repeat function (Many thanks @yskn67)
    • Add tilde option to normalize function

    0.4 (2018-12-06)

    • Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1) -> 無駄ァ

    0.3.2 (2018-05-17)

    • Add option for suppression removal of spaces between Japanese characters

    0.2.2 (2018-03-10)

    • Fix bug (daku-ten & handaku-ten)
    • Support mac osx 10.13 (Many thanks @r9y9)

    0.2.1 (2017-01-23)

    • Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)

    0.2 (2016-04-12)

    • Add lengthened expression (repeating character) threshold

    0.1.2 (2016-03-29)

    • Fix installation bug

    0.1.1.1 (2016-03-19)

    • Support Windows
    • Explicitly specify to -std=c++11 in build (Many thanks @id774)

    0.1.1 (2015-10-10)

    Initial release.

    Contribution

    Contributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md

    Cited by

    Book

    • 山本 和英. テキスト処理の要素技術. 近代科学者. P.41. 2021.

    Blog