November 2021
COMBO
A language-independent NLP system for dependency parsing, part-of-speech tagging, lemmatisation, morphological analysis, and more, built on top of PyTorch and AllenNLP.
COMBO is a language-independent NLP system for dependency parsing, part-of-speech tagging, lemmatisation, morphological analysis, and more. It is built on top of PyTorch and AllenNLP and provides automatically downloadable pre-trained models.
The system supports end-to-end morphosyntactic analysis and can be used for tasks such as:
- part-of-speech tagging,
- morphological analysis,
- lemmatisation,
- dependency parsing,
- enhanced Universal Dependencies parsing.
Quick Start
Install COMBO from the CLARIN-PL package index:
pip install -U pip setuptools wheel
pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.7
For Python 3.9, installing Cython may also be required:
pip install -U pip cython
Run predictions with a pre-trained model:
from combo.predict import COMBO
nlp = COMBO.from_pretrained("polish-herbert-base-ud29")
sentence = nlp("COVID-19 to ostra choroba zakaźna układu oddechowego wywołana zakażeniem wirusem SARS-CoV-2.")
Predictions are accessible as token attributes:
print("{:5} {:15} {:15} {:10} {:10} {:10}".format('ID', 'TOKEN', 'LEMMA', 'UPOS', 'HEAD', 'DEPREL'))
for token in sentence.tokens:
print("{:5} {:15} {:15} {:10} {:10} {:10}".format(str(token.id), token.token, token.lemma, token.upostag, str(token.head), token.deprel))
Tutorial and Documentation
Citing
If you use COMBO in your research, please cite the following article:
Mateusz Klimaszewski and Alina Wróblewska. COMBO: State-of-the-Art Morphosyntactic Analysis. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2021, pp. 50-62. Link
@inproceedings{klimaszewski-wroblewska-2021-combo-state,
title = "{COMBO}: State-of-the-Art Morphosyntactic Analysis",
author = "Klimaszewski, Mateusz and
Wr{\'o}blewska, Alina",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-demo.7",
pages = "50--62",
abstract = "We introduce COMBO {--} a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector representations, extracted from hidden layers. COMBO is an easy to install Python package with automatically downloadable pre-trained models for over 40 languages. It maintains a balance between efficiency and quality. As it is an end-to-end system and its modules are jointly trained, its training is competitively fast. As its models are optimised for accuracy, they achieve often better prediction quality than SOTA. The COMBO library is available at: https://gitlab.clarin-pl.eu/syntactic-tools/combo.",
}
If you use an EUD module in your research, please cite the following article:
Mateusz Klimaszewski and Alina Wróblewska. COMBO: A New Module for EUD Parsing. Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies, 2021, pp. 158-166. Link
@inproceedings{klimaszewski-wroblewska-2021-combo,
title = "{COMBO}: A New Module for {EUD} Parsing",
author = "Klimaszewski, Mateusz and
Wr{\'o}blewska, Alina",
booktitle = "Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.iwpt-1.16",
doi = "10.18653/v1/2021.iwpt-1.16",
pages = "158--166",
abstract = "We introduce the COMBO-based approach for EUD parsing and its implementation, which took part in the IWPT 2021 EUD shared task. The goal of this task is to parse raw texts in 17 languages into Enhanced Universal Dependencies (EUD). The proposed approach uses COMBO to predict UD trees and EUD graphs. These structures are then merged into the final EUD graphs. Some EUD edge labels are extended with case information using a single language-independent expansion rule. In the official evaluation, the solution ranked fourth, achieving an average ELAS of 83.79{\%}. The source code is available at https://gitlab.clarin-pl.eu/syntactic-tools/combo.",
}