December 2023
Corpus of Contemporary Polish (KWJP)
A balanced and representative corpus of written Polish covering texts from 2011-2020, divided into fiction, non-fiction, and journalism.
The Corpus of Contemporary Polish (KWJP) is a balanced and representative corpus of written Polish covering texts from 2011-2020.
The corpus is divided into three main genres:
- fiction: novels and short stories,
- non-fiction: non-fiction books and thematic magazines,
- journalism: news media, including national and regional daily newspapers and weekly magazines.
Description
Annotation: lemmatization, morphosyntactic tags, dependency and constituency parses, named entities.
Responsible institution: Institute of Computer Science, Polish Academy of Sciences
Corpus size: 100 million segments in the balanced corpus and 1.43 billion segments in the full corpus.
Creation period: 2021-2023
Publication
When using KWJP, please cite the following article:
W. Kieraś, M. Marciniak, M. Łaziński, M. Woliński, K. Bojałkowska, W. Eźlakowski, Ł. Kobyliński, D. Komosińska, K. Krasnowska-Kieraś, M. Rudolf, A. Tomaszewska, J. Wołoszyn, N. Zawadzka-Paluektau. Korpus Współczesnego Języka Polskiego. Dekada 2011-2020. Język Polski, 2024. Link
@article{kieras:etal:2024:kwjp,
author = "Kieraś, W. and Marciniak, M. and Łaziński, M. and Woliński, M. and Bojałkowska, K. and Eźlakowski, W. and Kobyliński, Ł. and Komosińska, D. and Krasnowska-Kieraś, K. and Rudolf, M. and Tomaszewska, A. and Wołoszyn, J. and Zawadzka-Paluektau, N.",
title = "{K}orpus {W}spółczesnego {J}ęzyka {P}olskiego. {D}ekada 2011-2020",
journal = "Język Polski",
year = "2024",
doi = "10.31286/JP.001055",
url = "https://jezyk-polski.pl/index.php/jp/article/view/1062"
}