Pre-trained Chinese word vectors
Basic Settings
Window Size | Dynamic Window | Sub-sampling | Low-Frequency Word |
5 | Yes | 1e-5 | 10 |
Various Domains
Chinese Word Vectors trained with different representations, context features, and corpora.
Word2vec / Skip-Gram with Negative Sampling (SGNS) | ||||
Corpus | Context Features | |||
Word | Word + Ngram | Word + Character | Word + Character + Ngram | |
Baidu Encyclopedia 百度百科 | 300d | 300d | 300d | 300d |
Wikipedia_zh 中文维基百科 | 300d | 300d | 300d | 300d |
People's Daily News 人民日报 | 300d | 300d | 300d | 300d |
Sogou News 搜狗新闻 | 300d | 300d | 300d | 300d |
Financial News 金融新闻 | 300d | 300d | 300d | 300d |
Zhihu_QA 知乎问答 | 300d | 300d | 300d | 300d |
Weibo 微博 | 300d | 300d | 300d | 300d |
Literature 文学作品 | 300d | 300d | 300d | 300d |
Complete Library in Four Sections 四库全书* |
300d | 300d | NAN | NAN |
Mixed-large 综合 | 300d | 300d | 300d | 300d |
Positive Pointwise Mutual Information (PPMI) | ||||
Corpus | Context Features | |||
Word | Word + Ngram | Word + Character | Word + Character + Ngram | |
Baidu Encyclopedia 百度百科 | 300d | 300d | 300d | 300d |
Wikipedia_zh 中文维基百科 | 300d | 300d | 300d | 300d |
People's Daily News 人民日报 | 300d | 300d | 300d | 300d |
Sogou News 搜狗新闻 | 300d | 300d | 300d | 300d |
Financial News 金融新闻 | 300d | 300d | 300d | 300d |
Zhihu_QA 知乎问答 | 300d | 300d | 300d | 300d |
Weibo 微博 | 300d | 300d | 300d | 300d |
Literature 文学作品 | 300d | 300d | 300d | 300d |
Complete Library in Four Sections 四库全书* |
300d | 300d | NAN | NAN |
Mixed-large 综合 | 300d | 300d | 300d | 300d |
*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.
Various Co-occurrence Information
We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.
In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.
All vectors are trained by SGNS on Baidu Encyclopedia.
Feature | Co-occurrence Type | Target Word Vectors | Context Word Vectors |
Word | Word → Word | 300d | 300d |
Ngram | Word → Ngram (1-2) | 300d | 300d |
Word → Ngram (1-3) | 300d | 300d | |
Ngram (1-2) → Ngram (1-2) | 300d | 300d | |
Character | Word → Character (1) | 300d | 300d |
Word → Character (1-2) | 300d | 300d | |
Word → Character (1-4) | 300d | 300d | |
Radical | Radical | 300d | 300d |
Position | Word → Word (left/right) | 300d | 300d |
Word → Word (distance) | 300d | 300d | |
Global | Word → Text | 300d | 300d |
Syntactic Feature | Word → POS | 300d | 300d |
Word → Dependency | 300d | 300d |