corpora

we made great effort to collect and preprocess corpora in various sizes and domains. All the text data are preprocessed via the following steps:

Remove the html and xml tags from the texts and set the encoding as utf-8. Digit and punctuation are remained.
Covert trainditional Chinese characters into simplified characters with Open Chinese Convert (OpenCC).
Conduct Chinese word segmentation with HanLP(v_1.5.3).

Corpus	Size	Tokens	Vocabulary Size	Description
Baidu Encyclopedia 百度百科	4.1G	745M	5422K	Chinese Encyclopedia data from https://baike.baidu.com/
Wikipedia_zh 中文维基百科	1.3G	223M	2129K	Chinese Wikipedia data from https://dumps.wikimedia.org/
People's Daily News 人民日报	3.9G	668M	1664K	News data from People's Daily(1946-2017) http://data.people.com.cn/
Sogou News 搜狗新闻	3.7G	649M	1226K	News data provided by Sogou labs http://www.sogou.com/labs/
Financial News 金融新闻	6.2G	1055M	2785K	News data grabbed from the Internet
Zhihu_QA 知乎问答	2.1G	384M	1117K	Chinese QA data from https://www.zhihu.com/
Weibo 微博	0.73G	136M	850K	China-based microblogs from https://weibo.com/
Literature 文学作品	0.93G	177M	702K	8599 modern Chinese literature works
Mixed-large 综合	22.6G	4037M	10653K	We build the large corpus by merging the above corpora.
Complete Library in Four Sections 四库全书	1.5G	714M	21.8K	It was the largest collection of texts in pre-modern China.