Datatang adds 5,000 Traditional Chinese characters to OCR system

March 16, 2023

Beijing-based artificial intelligence (AI) company Datatang has updated its optical character recognition (OCR) database to include 5,000 handwritten characters in Traditional Chinese.

In a dedicated webpage for the new set, Datatang said the characters were collected by various samples written on A4 paper, square paper, and lined paper, among others.

By adding the characters to its software suite, Datatang enables customers to use OCR of the corresponding traditional Chinese characters when encountering them in the wild. In other words, by scanning a text through a smartphone and the Datatang app, users will now be able to automate data entry and filling out forms.

OCR is sometimes implemented for document scanning in digital identity verification and onboarding applications.

According to the company, the error bound of each vertex of the quadrilateral bounding box around each character is within five pixels, for a qualified annotation. The accuracy of bounding boxes and text transcription accuracy are both reportedly not less than 97 percent.

The addition of the new dataset comes months after Datatang executives said their speech recognition datasets were created with native language speakers and surpassed the industry’s standards.

More recently, the company showcased its synthetic data generation technology at the 2022 Conference on Computer Vision and Pattern Recognition (CVPR 2022).