Welcome to Chinese vector service (CVS)
This project provides a 'one-stop' Chinese vector service (CVS). It consists of the following components:
-
corpora. It is well known that the property of word embedding is closely related with its training corpus. We made huge effort to collect and preprocess a large amount of Chinese corpora with various sizes and domains.
-
toolkit. Ngram2vec toolkit is used for training vectors. Ngram2vec supports arbitrary features and models, and allow users to improve existing word representation methods with no effort.
-
Pre-trained vectors. We release Chinese vectors trained with different representations (dense and sparse), features, and corpora, which could meet different requirements of users.
-
Evaluation. By now, two Chinese analogical datasets, CA-translated and CA8, are used as intrinsic benchmark for evaluating Chinese word vectors. A sequence tagging dataset and a text classification dataset are used as external benchmark.
Also, we opensource an evaluation toolkit for analogy task, which ensures fair comparison of different word vectors.