Posts in tag

CCMatrix


CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the¬†WikiMatrix corpus¬†that we shared last year. Gathering a data set of this size …