Corpora
Preprocessing steps
we made great effort to collect and preprocess corpora in various sizes and domains. All the text data are preprocessed via the following steps:
- Remove the html and xml tags from the texts and set the encoding as utf-8. Digit and punctuation are remained.
- Covert trainditional Chinese characters into simplified characters with Open Chinese Convert (OpenCC).
- Conduct Chinese word segmentation with HanLP(v_1.5.3).
Corpora details
Corpus | Size | Tokens | Vocabulary Size | Description |
Baidu Encyclopedia 百度百科 |
4.1G | 745M | 5422K | Chinese Encyclopedia data from https://baike.baidu.com/ |
Wikipedia_zh 中文维基百科 |
1.3G | 223M | 2129K | Chinese Wikipedia data from https://dumps.wikimedia.org/ |
People's Daily News 人民日报 |
3.9G | 668M | 1664K | News data from People's Daily(1946-2017) http://data.people.com.cn/ |
Sogou News 搜狗新闻 |
3.7G | 649M | 1226K | News data provided by Sogou labs http://www.sogou.com/labs/ |
Financial News 金融新闻 |
6.2G | 1055M | 2785K | News data grabbed from the Internet |
Zhihu_QA 知乎问答 |
2.1G | 384M | 1117K | Chinese QA data from https://www.zhihu.com/ |
Weibo 微博 |
0.73G | 136M | 850K | China-based microblogs from https://weibo.com/ |
Literature 文学作品 |
0.93G | 177M | 702K | 8599 modern Chinese literature works |
Mixed-large 综合 |
22.6G | 4037M | 10653K | We build the large corpus by merging the above corpora. |
Complete Library in Four Sections 四库全书 |
1.5G | 714M | 21.8K | It was the largest collection of texts in pre-modern China. |