SudachiPy0.6.8
Published
Python version of Sudachi, the Japanese Morphological Analyzer
pip install sudachipy
Package Downloads
Authors
Project URLs
Requires Python
SudachiPy
SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.
This is not a pure Python implementation, but bindings for the Sudachi.rs.
Binary wheels
We provide binary builds for macOS (10.14+), Windows and Linux only for x86_64 architecture. x86 32-bit architecture is not supported and is not tested. MacOS source builds seem to work on ARM-based (Aarch64) Macs, but this architecture also is not tested and require installing Rust toolchain and Cargo.
More information here.
TL;DR
$ pip install sudachipy sudachidict_core
$ echo "高輪ゲートウェイ駅" | sudachipy
高輪ゲートウェイ駅 名詞,固有名詞,一般,*,*,* 高輪ゲートウェイ駅
EOS
$ echo "高輪ゲートウェイ駅" | sudachipy -m A
高輪 名詞,固有名詞,地名,一般,*,* 高輪
ゲートウェイ 名詞,普通名詞,一般,*,*,* ゲートウェー
駅 名詞,普通名詞,一般,*,*,* 駅
EOS
$ echo "空缶空罐空きカン" | sudachipy -a
空缶 名詞,普通名詞,一般,*,*,* 空き缶 空缶 アキカン 0
空罐 名詞,普通名詞,一般,*,*,* 空き缶 空罐 アキカン 0
空きカン 名詞,普通名詞,一般,*,*,* 空き缶 空きカン アキカン 0
EOS
from sudachipy import Dictionary, SplitMode
tokenizer = Dictionary().create()
morphemes = tokenizer.tokenize("国会議事堂前駅")
print(morphemes[0].surface()) # '国会議事堂前駅'
print(morphemes[0].reading_form()) # 'コッカイギジドウマエエキ'
print(morphemes[0].part_of_speech()) # ['名詞', '固有名詞', '一般', '*', '*', '*']
morphemes = tokenizer.tokenize("国会議事堂前駅", SplitMode.A)
print([m.surface() for m in morphemes]) # ['国会', '議事', '堂', '前', '駅']
Setup
You need SudachiPy and a dictionary.
Step 1. Install SudachiPy
$ pip install sudachipy
Step 2. Get a Dictionary
You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core
edition).
$ pip install sudachidict_core
Alternatively, you can choose other dictionary editions. See this section for the detail.
Usage: As a command
There is a CLI command sudachipy
.
$ echo "外国人参政権" | sudachipy
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国 名詞,普通名詞,一般,*,*,* 外国
人 接尾辞,名詞的,一般,*,*,* 人
参政 名詞,普通名詞,一般,*,*,* 参政
権 接尾辞,名詞的,一般,*,*,* 権
EOS
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]
[-a] [-d] [-v]
[file [file ...]]
Tokenize Text
positional arguments:
file text written in utf-8
optional arguments:
-h, --help show this help message and exit
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-s string sudachidict type
-a print all of the fields
-d print the debug information
-v, --version print sudachipy version
Note: The Debug option (-d
) is disabled in version 0.6.0.
Output
Columns are tab separated.
- Surface
- Part-of-Speech Tags (comma separated)
- Normalized Form
When you add the -a
option, it additionally outputs
- Dictionary Form
- Reading Form
- Dictionary ID
0
for the system dictionary1
and above for the user dictionaries-1
if a word is Out-of-Vocabulary (not in the dictionary)
- Synonym group IDs
(OOV)
if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "外国人参政権" | sudachipy -a
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0 []
EOS
echo "阿quei" | sudachipy -a
阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 [] (OOV)
quei 名詞,普通名詞,一般,*,*,* quei quei -1 [] (OOV)
EOS
Usage: As a Python package
API
See API reference page.
Example
from sudachipy import Dictionary, SplitMode
tokenizer_obj = Dictionary().create()
# Multi-granular Tokenization
# SplitMode.C is the default mode
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.C)]
# => ['国家公務員']
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.B)]
# => ['国家', '公務員']
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.A)]
# => ['国家', '公務', '員']
# Morpheme information
m = tokenizer_obj.tokenize("食べ")[0]
m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization
tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'
(With 20210802
core
dictionary. The results may change when you use other versions)
Dictionary Edition
There are three editions of Sudachi Dictionary, namely, small
, core
, and full
. See WorksApplications/SudachiDict for the detail.
SudachiPy uses sudachidict_core
by default.
Dictionaries are installed as Python packages sudachidict_small
, sudachidict_core
, and sudachidict_full
.
The dictionary files are not in the package itself, but it is downloaded upon installation.
Dictionary option: command line
You can specify the dictionary with the tokenize option -s
.
$ pip install sudachidict_small
$ echo "外国人参政権" | sudachipy -s small
$ pip install sudachidict_full
$ echo "外国人参政権" | sudachipy -s full
Dictionary option: Python package
You can specify the dictionary with the Dicionary()
argument; config_path
or dict_type
.
class Dictionary(config_path=None, resource_dir=None, dict_type=None)
config_path
- You can specify the file path to the setting file with
config_path
(See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail). - If the dictionary file is specified in the setting file as
systemDict
, SudachiPy will use the dictionary.
- You can specify the file path to the setting file with
dict_type
- You can also specify the dictionary type with
dict_type
. - The available arguments are
small
,core
, orfull
. - If different dictionaries are specified with
config_path
anddict_type
, a dictionary defineddict_type
overrides those defined in the config path.
- You can also specify the dictionary type with
from sudachipy import Dictionary
# default: sudachidict_core
tokenizer_obj = Dictionary().create()
# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used
tokenizer_obj = Dictionary(config_path="/path/to/sudachi.json").create()
# The dictionary specified by `dict_type` will be set.
tokenizer_obj = Dictionary(dict_type="core").create() # sudachidict_core (same as default)
tokenizer_obj = Dictionary(dict_type="small").create() # sudachidict_small
tokenizer_obj = Dictionary(dict_type="full").create() # sudachidict_full
# The dictionary specified by `dict_type` overrides those defined in the config path.
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.
tokenizer_obj = Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()
Dictionary in The Setting File
Alternatively, if the dictionary file is specified in the setting file, sudachi.json
, SudachiPy will use that file.
{
"systemDict" : "relative/path/from/resourceDir/to/system.dic",
...
}
The default setting file is sudachi.json. You can specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
User Dictionary
To use a user dictionary, user.dic
, place sudachi.json to anywhere you like, and add userDict
value with the relative path from sudachi.json
to your user.dic
.
{
"userDict" : ["relative/path/to/user.dic"],
...
}
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
You can build a user dictionary with the subcommand ubuild
.
$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
Build User Dictionary
positional arguments:
file source files with CSV format (one or more)
optional arguments:
-h, --help show this help message and exit
-d string description comment to be embedded on dictionary
-o file output file (default: user.dic)
-s file system dictionary path (default: system core dictionary path)
About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).
Customized System Dictionary
$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
Build Sudachi Dictionary
positional arguments:
file source files with CSV format (one of more)
optional arguments:
-h, --help show this help message and exit
-o file output file (default: system.dic)
-d string description comment to be embedded on dictionary
required named arguments:
-m file connection matrix file with MeCab's matrix.def format
To use your customized system.dic
, place sudachi.json to anywhere you like, and overwrite systemDict
value with the relative path from sudachi.json
to your system.dic
.
{
"systemDict" : "relative/path/to/system.dic",
...
}
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
For Developers
Build from source
Install sdist via pip
- Install python module
setuptools
andsetuptools-rust
. - Run
./build-sdist.sh
inpython
dir.- source distribution will be generated under
python/dist/
dir.
- source distribution will be generated under
- Install it via pip:
pip install ./python/dist/SudachiPy-[version].tar.gz
Install develop build
- Install python module
setuptools
andsetuptools-rust
. - Run
python3 setup.py develop
.develop
will create a debug build, whileinstall
will create a release build.
- Now you can import the module by
import sudachipy
.
ref: setuptools-rust
Test
Run build_and_test.sh
to run the tests.
Contact
Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitation here)
Enjoy tokenization!