thai-nner0.3
Published
Thai Nested Named Entity Recognition
pip install thai-nner
Package Downloads
Authors
Project URLs
Requires Python
>=3.6
Thai-NNER (Thai Nested Named Entity Recognition Corpus)
Code associated with the paper Thai Nested Named Entity Recognition Corpus at ACL 2022.
Abstract / Motivation
This work presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes.
How to use?
Install
pip install thai_nner
Usage
You needs to download model from "data/[checkpoints]": Download
Example: 0906_214036/checkpoint.pth
and use convert_model2use.py
script by
python convert_model2use.py -i 0906_214036/checkpoint.pth -o model.pth
Usage Example
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0" # for non-gpu: os.environ['CUDA_VISIBLE_DEVICES'] = ""
from thai_nner import NNER
nner = NNER("model.pth")
nner.get_tag("วันนี้วันที่ 5 เมษายน 2565 เป็นวันที่อากาศดีมาก")
# output: (['<s>', 'วันนี้', 'วันที่', '', '', '5', '', '', 'เมษายน', '', '', '25', '65', '', '', 'เป็น', 'วันที่', '', 'อากาศ', '', 'ดีมาก', '</s>'], [{'text': ['วันนี้'], 'span': [1, 2], 'entity_type': 'rel'}, {'text': ['วันที่', '', '', '5'], 'span': [2, 6], 'entity_type': 'day'}, {'text': ['วันที่', '', '', '5', '', '', 'เมษายน', '', '', '25', '65'], 'span': [2, 13], 'entity_type': 'date'}, {'text': ['', '5'], 'span': [4, 6], 'entity_type': 'cardinal'}, {'text': ['', 'เมษายน'], 'span': [7, 9], 'entity_type': 'month'}, {'text': ['', '25', '65'], 'span': [10, 13], 'entity_type': 'year'}])
Example
Python library
Test
Dataset and Models
Model's Checkpoint
Download and save models' checkpoints at the following path "data/[checkpoints]": Download
Dataset
Download and save the dataset at the following path "data/[scb-nner-th-2022]": Download
Pre-trained Language Model
Download and save the pre-trained language model at the following path "data/[lm]": Download
Training/Testing
Train
python train.py --device 0,1 -c config.json
Test
python test_nne.py --resume [PATH]/checkpoint.pth
Tensorboard
tensorboard --logdir [PATH]/save/log/
Results
Citation
@inproceedings{Buaphet-etal-2022-thai-nner,
title = "Thai Nested Named Entity Recognition Corpus",
author = "Buaphet, Weerayut and
Udomcharoenchaikit, Can and
Limkonchotiwat, Peerat and
Rutherford, Attapol and
Nutanong, Sarana",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022"
year = "2022",
publisher = "Association for Computational Linguistics",
}
License
CC-BY-SA 3.0
Acknowledgements
- Dataset information: The Thai N-NER corpus is supported in part by the Digital Economy Promotion Agency (depa) Digital Infrastructure Fund MP-62-003 and Siam Commercial Bank. This dataset is released as scb-nner-th-2022.
- Training code: Tensorflow-Project-Template by Mahmoud Gemy