Automatic page frame registration of digitized text images using connected components - Marco Klindt

Illustration of the setup

While more and more documents are being stored, transmitted and used only in a digital format, old books or other printed materials have to be digitized either for archival reasons or to be usable in further processing applications. During the image acquisition process either by flatbed scanners or by digital cameras artefacts like noise, borders, skew, perspective distortion, or warping might be introduced, all of which may diminish further usability of the digital copies. This thesis discusses a framework to deal with these artefacts and reconstruct 
the aligned text region of a single page by adaptively thresholding the input into a binary representation and employing a connected component labeling approach as a bottom-up method to extract entities that are used as input to algorithms that determine the classes of distortion present in the image, detect the global skew angle, and, if applicable, estimate distortion parameters for flattening the page image onto a plane representing a sheet of paper. Using these parameters the text image is finally skew corrected, flattened, cropped, and saved into the output image to achieve the desired result.

