Image Processing of Historical Documents
The ProHist project aims the development of algorithms and techniques for preservation and publishing images of historical documents. For preservation purposes, the images are suggested to be stored in JPEG file format with 1% of loss for a better quality/storage space relation. In order to make the document contents more easily accessible, several techniques are being developed related to:
- Image thresholding: this is related to the conversion of the document image into black and white with preservation of the contents of the document. This is not a simple task due to smear, smudges and other noisy elements present in the document and, consequently, in its digital version. Several algorithms were developed to work with these kind of images as:
- NEVES, Renata Freire de Paiva ; MELLO, C. A. B. . A Local Thresholding Algorithm for Images of Handwritten Historical Documents. In: 2011 IEEE International Conference on Systems, Man, and Cybernetics, 2011, Anchorage. Proceedings of the 2011 IEEE International Conference on Systems, Man, and Cybernetics, 2011. p. 2934-2939.
- MELLO, C. A. B. . Segmentation of Images of Stained Papers Based on Distance Perception. In: IEEE Systems, Mand and Cybernetics (SMC), 2010, Istambul. Proceedings of the SMC 2010, 2010. p. 1636-1642.
- MELLO, C. A. B. "A Visual Perception Approach to Segment Images of Historical Documents". International Workshop on Document Analysis Systems, 2010, Boston.
- MELLO, C. A. B. ; OLIVEIRA, Adriano Lorena Inácio de ; SANCHEZ, Ángel. "Historical Document Image Binarization". International Conference on Computer Vision Theory and Applications, 2008, Funchal, Portugal, v. 1. p. 108-113.
- MELLO, C. A. B.; SANCHEZ, Ángel; OLIVEIRA, Adriano Lorena Inácio de; LOPES, Alberto. "An Efficient Gray-Level Thresholding Algorithm for Historic Document Images". Journal of Cultural Heritage, v. 9, p. 109-116, 2008.
- MELLO, C. A. B.; Schuler, L.A. "Thresholding Images of Historical Documents Using a Tsallis-Entropy Based Algorithm". Journal of Software, v. 3, p. 29-36, 2008
- MELLO, C. A. B. "New Tsallis Entropy-Based Thresholding Algorithm for Images of Historical Documents". ACM Document Engineering, 2007, Winnipeg, p. 32-34.
- MELLO, C. A. B.; Schuler, L.A. "Tsallis Entropy-Based Thresholding Algorithm for Images of Historical Documents". IEEE International Conference on Systems, Man and Cybernetics, 2007, Montreal. , p. 1112-1117.
- MELLO, C. A. B. "Image Thresholding of Historical Documents based on Tsallis Entropy". XXXIII CLEI Conferência Latinoamericana de Informática, 2007, San José.
- MELLO, C. A. B. "An Algorithm for Foreground-Background Separation in Low Quality Patrimonial Document Images". 12th Iberoamerican Congress on Pattern Recognition, 2007, Valparaiso. Lecture Notes in Computer Science, p. 911-920.
- MELLO, C. A. B. ; LINS, R. D. "Image Segmentation of Historical Documents". Visual2000, 2000, Cidade do México, México.
- Segmentation: which is the main part of several aspects of docuemnt image processing. Currently we are working on the segmentation of overlapped cursive digits and lines and word segmentation as described in:
- SANCHEZ, Ángel ; MELLO, C. A. B. ; SUAREZ, P. ; LOPES, Alberto . Automatic line and word segmentation applied to densely line-skewed historical handwritten document images. Integrated Computer-Aided Engineering, v. 18, p. 125-142, 2011.
- ROE, E. ; MELLO, C. A. B. "Simulating Inertial and Centripetal Forces for Segmentation of Overlapped Handwritten Digits". IEEE Systems, Man and Cybernetics (SMC), 2009, San Antonio, p. 149-153.
- MELLO, C. A. B. ; ROE, E. ; LACERDA, E. B. "Segmentation of Overlapping Cursive Handwritten Digits". ACM Document Engineering, 2008, São Paulo.
- SANCHEZ, Ángel ; MELLO, C. A. B. ; SUAREZ, P. ; OLIVEIRA, Adriano Lorena Inácio de ; ALVES, Victor Medeiros Outtes. "Text Line Segmentation in Images of Handwritten Historical Documents". IEEE International Workshops on Image Processing Theory, Tools and Applications, 2008, Sousse.
- Skew Estimation: At this part, we deal with estimation of skew angles in documents and in the writing also (in this case, we have an algorithm to deal with multiple skew estimation as in handwritten documents where the angle of the writing changes from one line to the other).
- Mascaro, Angélica A. ; Cavalcanti, George D.C. ; MELLO, C. A. B. . Fast and robust skew estimation of scanned documents through background area information. Pattern Recognition Letters, v. 31, p. 1403-1411, 2010.
- MELLO, C. A. B. ; SANCHEZ, Ángel ; CAVALCANTI, G. D. C. . Multiple Line Skew Estimation of Handwritten Images of Documents Based on a Visual Perception Approach. In: 14th International Conference on Computer Analysis of Images and Patterns, 2011, Seville. Lecture Notes in Computer Science, 2011. v. 6855. p. 138-145.
- Optical Character Recognition: a conversion to a text file with the full preservation of the contents of the document. The main problem here is that there are handwritten documents in the archive. So a conversion os not so easily done. We also deal with several issues of document segmantation as: line segmentation, word extraction and segmentation of overlapping symbols:
- NEVES, Renata Freire de Paiva ; LOPES, Alberto ; MELLO, C. A. B. ; ZANCHETTIN, C. . A SVM Based Off-Line Handwritten Digit Recognizer. In: 2011 IEEE International Conference on Systems, Man, and Cybernetics, 2011, Anchorage. Proceedings of the 2011 IEEE International Conference on Systems, Man, and Cybernetics, 2011. p. 510-515.
- OLIVEIRA, Adriano Lorena Inácio de ; MELLO, C. A. B. ; MEDEIROS, Victor Outtes Alves ; SILVA JR, Elias Rodrigues da. "Optical Digit Recognition for Images of Handwritten Historical Documents". Simpósio Brasileiro de Redes Neurais, 2006, Ribeirão Preto, v. 9. p. 29.
- Synthesis of the Documents: a synthetic image of a document can be created which can be used as a preview of the original document.
- MELLO, C. A. B. "Synthesis of Images of Historical Documents for Web Visualization". IEEE International Multi-Media Modelling Conference, 2004, Brisbane, p. 220-226.
- MELLO, C. A. B.; CAVALCANTI, C. S.; A.M.CARVALHO, C. "Using Neural Networks for Colorizing Paper Texture of Images of Historical Documents". Simpósio Brasileiro de Redes Neurais, 2004, São Luís
- MELLO, C. A. B.; CAVALCANTI, C. S.; A.M.CARVALHO, C. "Colorizing Paper Texture of Green Scale Images of Historical Documents". IASTED Visualization, Imaging and Image Processing - VIIP2004, 2004, Marbella
- MELLO, C. A. B.; LINS, R. D. "Generation of Images of Historical Documents by Composition of their Components". Vision Interface 2002, Calgary
- MELLO, C. A. B. ; LINS, R. D. "Generation of Images of Historical Documents by Composition". ACM Symposium on Document Engineering, 2002, McLean, VA.
- MELLO, C. A. B. ; LINS, R. D. "Generating Paper Texture with Statistical Moments". IEEE International Conference on Acoustic, Speech and Signal Processing, 2000, Istanbul.
- Digital library: currently, we are grouping all the techniques developed in the project in a single environment: a digital library for image processing of historical documents based on FEDORA and Islandora:
- MELLO, C. A. B. ; OLIVEIRA, Adriano Lorena Inácio de ; SANCHEZ, Ángel. "PROHIST: An Environment for Image Processing of Historical Documents". Webmedia 2007 - Workshop de Bibliotecas Digitais, 2007, Gramado, p. 143-147.
- Document Image Retrieval: we are beginning a new study for docuemnt image retrieval of images of historical documents.
The digital library (under development using FEDORA) will have the following features:
- Thresholding the colored original images;
- Lines and word segmentation is applied in the bi-level images just as estimation of multiple inclination of the lines;
- The paper of the documents is used for feature extraction and further synthesis of the paper without the ink;
- Document image retrieval techniques will be available to search for specific documents in the database.
A screenshot of the digital library can be seen below:
Currently, our archive is composed by four main classes of documents: historical documents from the end of 19th century and begining of 20th century; old newspaper pages, some of them from the 18th century; documents from Polytechnic Shool of Pernambuco from the begining of the 20th century; and large scale maps and architectural plants.
Related Projects
- 2008 - 2010: PROHIST - Processamento de Imagens de Documentos Históricos (CNPq 470541/2008-3 Universal 2008)
- 2007 - 2007: Reconocimiento de Caracteres Manuscritos en Documentos Históricos (http://www.boe.es/boe/dias/2007/01/10/pdfs/A01206-01242.pdf - Financiador: Ministerio de Asuntos Exteriores Y de Cooperación )
- 2006 - 2006: Preservación de Documentos Históricos mediante Técnicas de Tratamiento Digital de Imágenes (Financiador: Ministerio de Asuntos Exteriores Y de Cooperación)
- 2003 - 2005: Processamento de Imagens de Documentos Históricos (Projeto CNPq PDPG-TI 55.0017/2003-8)
More information can be found in our published papers.
Dr. Carlos Alexandre Barros de Mello