CCMatrix: A Billion-scale Bitext Data Set For Training Translation Models
CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the…
Share