The digitization of Chinese classics is challenging, as Chinese ancient characters are complex. Throughout history, one Chinese character might have several variants and written forms. Digitizing Chinese ancient books through optical character recognition (OCR) not only facilitates machine reading but also gives a new life to numerous ancient books for public peruse.
Alibaba DAMO Academy (DAMO), the global research institute of Alibaba, started a new project to digitize Chinese classics together with the Alibaba Foundation, the Library of University of California, Berkeley, Sichuan University, National Library of China, and Zhejiang Library. The program aims to digitize and aggregate ancient Chinese books and convert scanned images into texts for open access. This way, libraries in China and abroad can work together to make their ancient Chinese books freely available to the world.
Jeff Zhang, Head of Alibaba DAMO Academy, said: “Alibaba will continue to invest in resources and cutting-edge technology to support such projects. Making ancient books available to the public is in line with our values and belief in ‘Tech for Change’. We believe that technology can play a critical role in preserving precious cultural relics and heritage, and we look forward to working with libraries in China and abroad to make this happen.”
The first batch of Chinese classics in this joint effort comes from the C.V. Starr East Asian Library of University of California, Berkeley, one of the largest academic libraries with rich holdings of Chinese ancient books. 200,000 digital pages of ancient books are now on display including woodblock printed books and manuscripts from the Song Dynasty and Yuan Dynasty, a period in ancient China dating back over 1,000 years ago. Other materials include digital pages of an original volume of Siku Quanshu 四库全书, The Complete Works of Chinese Classics from the Qing Dynasty.
UC Berkeley Library provided scanned pages and metadata while DAMO used optical character recognition (OCR) to turn the scanned images into text. Furthermore, DAMO teamed up with scholars in Sichuan University to develop an AI model for single-character indexing, automatic character grouping, and various forms of machine learning such as self-supervised learning and few shot learning. This model yields an accuracy rate of 97.5% in recognizing ancient characters. The new model can now recognize 30,000 ancient Chinese characters with efficiency, surpassing the speed of human reading by thirtyfold.
“Alibaba will make this AI system for the machine-reading of Chinese ancient books available to the public soon,” Jeff added.