Pre-trained Chinese word vectors

Basic Settings

Window SizeDynamic WindowSub-samplingLow-Frequency Word

Various Domains

Chinese Word Vectors trained with different representations, context features, and corpora.

Word2vec / Skip-Gram with Negative Sampling (SGNS)
Corpus Context Features
Word Word + Ngram Word + Character Word + Character + Ngram
Baidu Encyclopedia 百度百科 300d 300d 300d 300d
Wikipedia_zh 中文维基百科 300d 300d 300d 300d
People's Daily News 人民日报 300d 300d 300d 300d
Sogou News 搜狗新闻 300d 300d 300d 300d
Financial News 金融新闻 300d 300d 300d 300d
Zhihu_QA 知乎问答 300d 300d 300d 300d
Weibo 微博 300d 300d 300d 300d
Literature 文学作品 300d 300d 300d 300d
Complete Library in Four Sections
300d 300d NAN NAN
Mixed-large 综合 300d 300d 300d 300d
Positive Pointwise Mutual Information (PPMI)
Corpus Context Features
Word Word + Ngram Word + Character Word + Character + Ngram
Baidu Encyclopedia 百度百科 300d 300d 300d 300d
Wikipedia_zh 中文维基百科 300d 300d 300d 300d
People's Daily News 人民日报 300d 300d 300d 300d
Sogou News 搜狗新闻 300d 300d 300d 300d
Financial News 金融新闻 300d 300d 300d 300d
Zhihu_QA 知乎问答 300d 300d 300d 300d
Weibo 微博 300d 300d 300d 300d
Literature 文学作品 300d 300d 300d 300d
Complete Library in Four Sections
300d 300d NAN NAN
Mixed-large 综合 300d 300d 300d 300d

*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.

Various Co-occurrence Information

We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.

In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.

All vectors are trained by SGNS on Baidu Encyclopedia.

FeatureCo-occurrence TypeTarget Word VectorsContext Word Vectors
Word Word → Word300d 300d
Ngram Word → Ngram (1-2) 300d 300d
Word → Ngram (1-3) 300d 300d
Ngram (1-2) → Ngram (1-2) 300d 300d
CharacterWord → Character (1) 300d 300d
Word → Character (1-2) 300d 300d
Word → Character (1-4) 300d 300d
Radical Radical 300d 300d
PositionWord → Word (left/right) 300d 300d
Word → Word (distance) 300d 300d
GlobalWord → Text 300d 300d
Syntactic FeatureWord → POS 300d 300d
Word → Dependency300d 300d