Welcome to Chinese vector service (CVS)
This project provides a 'one-stop' Chinese vector service (CVS). It consists of the following components:
corpora. It is well known that the property of word embedding is closely related with its training corpus. We made huge effort to collect and preprocess a large amount of Chinese corpora with various sizes and domains.
toolkit. Ngram2vec toolkit is used for training vectors. Ngram2vec supports arbitrary features and models, and allow users to improve existing word representation methods with no effort.
Pre-trained vectors. We release Chinese vectors trained with different representations (dense and sparse), features, and corpora, which could meet different requirements of users.
Evaluation. By now, two Chinese analogical datasets, CA-translated and CA8, are used as intrinsic benchmark for evaluating Chinese word vectors. A sequence tagging dataset and a text classification dataset are used as external benchmark.
Also, we opensource an evaluation toolkit for analogy task, which ensures fair comparison of different word vectors.