SEFR-CUT1.1
SEFR-CUT1.1
Published
Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP2020)
pip install sefr-cut
Package Downloads
Authors
Project URLs
Requires Python
SEFR CUT (Stacked Ensemble Filter and Refine for Word Segmentation)
Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP 2020)
CRF as Stacked Model and DeepCut as Baseline model
Read more:
- Paper: Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble
- Blog: Domain Adaptation กับตัวตัดคำ ม ันดีย์จริงๆ
Install
pip install sefr_cut
How To use
Requirements
- python >= 3.6
- python-crfsuite >= 0.9.7
- pyahocorasick == 1.4.0
Example
- Example files are on SEFR Example notebook
- Try it on Colab
Load Engine & Engine Mode
- ws1000, tnhc
- ws1000: Model trained on Wisesight-1000 and test on Wisesight-160
- tnhc: Model trained on TNHC (80:20 train&test split with random seed 42)
- BEST: Trained on BEST-2010 Corpus (NECTEC)
sefr_cut.load_model(engine='ws1000') # OR sefr_cut.load_model(engine='tnhc') # OR sefr_cut.load_model(engine='best')
- tl-deepcut-XXXX
- We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
sefr_cut.load_model(engine='tl-deepcut-ws1000') # OR sefr_cut.load_model(engine='tl-deepcut-tnhc')
- deepcut
- We also provide the original deepcut
sefr_cut.load_model(engine='deepcut')
Segment Example
- Segment with default k
sefr_cut.load_model(engine='ws1000') print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'])) print(sefr_cut.tokenize(['สวัสดีประเทศไทย'])) print(sefr_cut.tokenize('สวัสดีประเทศไทย')) [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', '