This paper introduces an effective character extraction algorithm that can be used for optical character recognition (OCR). Using both geometrical and colour information, the character extraction algorithm can extract text from colour document images which contain mixed text and pictures. The algorithm consists of three components, i.e., adaptive k-means clustering, binary morphological processing, and shape and space-related refinement. When the algorithm is used as a plug-in pre-processing stage for an OCR system, the performance of the system can be improved. Character recognition experiment was done with a commercial OCR package. It has been shown that our algorithm can improve character recognition rate on complex document from 73.1% to 95.5% on average.
Asia-Pacific Workshop Visual Information Processing. Proceeding for the Asia-Pacific Workshop Visual Information Processing (Beijing, China 7-9th November, 2006) p. 219-223