nlpo31.3.0
nlpo31.3.0
Published
Python binding for nlpO3 Thai language processing library in Rust
pip install nlpo3
Package Downloads
Authors
Project URLs
Requires Python
>=3.6
Dependencies
nlpO3 Python binding
Python binding for nlpO3, a Thai natural language processing library in Rust.
Features
- Thai word tokenizer
segment()
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries- 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
load_dict()
- load a dictionary from plain text file (one word per line)
Dictionary file
- For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use. It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- words_th.tx from PyThaiNLP - around 62,000 words (CC0)
- word break dictionary from libthai - consists of dictionaries in different categories, with make script (LGPL-2.1)
Install
pip install nlpo3
Usage
Load file path/to/dict.file
to memory and assign a name dict_name
to it.
Then tokenize a text with the dict_name
dictionary:
from nlpo3 import load_dict, segment
load_dict("path/to/dict.file", "custom_dict")
segment("สวัสดีคร ับ", "dict_name")
it will return a list of strings:
['สวัสดี', 'ครับ']
(result depends on words included in the dictionary)
Use multithread mode, also use the dict_name
dictionary:
segment("สวัสดีครับ", dict_name="dict_name", parallel=True)
Use safe mode to avoid long waiting time in some edge cases for text with lots of ambiguous word boundaries:
segment("สวัสดีครับ", dict_name="dict_name", safe=True)
Build
Requirements
- Rust 2018 Edition
- Python 3.6 or newer
- Python Development Headers
- Ubuntu:
sudo apt-get install python3-dev
- macOS: No action needed
- Ubuntu:
- PyO3 - already included in Cargo.toml
- setuptools-rust
Steps
python -m pip install --upgrade build
python -m build
This should generate a wheel file, in dist/
directory, which can be installed by pip.
Issues
Please report issues at https://github.com/PyThaiNLP/nlpo3/issues