Content extraction from PDF invoices on business document archives

Bandara RMCV

UoM IR
→
Thesis & Dissertation
→
Faculty of Engineering, Computer Science & Engineering
→
Master of Science in Computer science and Engineering
→
View Item

Content extraction from PDF invoices on business document archives

Bandara RMCV

URI: http://dl.lib.uom.lk/handle/123/16780

Abstract:

organization better control over their information processes. When a business expands, more documents will be produced, and it needs to be carefully handled and tracked to make good use of. Output management systems that are working with ERP systems contains thousands of business documents and Portable document format (PDF) is the common output format for these types of documents. These systems need to execute documents search operations frequently. PDF documents Indexing is a critical part in this context. It will boost document search engine efficiency by cutting search space. Content extraction from PDF documents goes a step further and it will allow more structured search queries. Extracting the document content from a PDF file is a very important. But this is a very challenging task because PDF is a layout-based format that defines the fonts and locations of the individual character as opposed to the semantic units of the text and their role within the document. In this research I have developed a technique to extract content from a PDF file. We can use it for allow more structured search queries on large document archives in output management systems typically work with world leading ERP systems. On this research mainly considered on four aspects which are correctly identifying words, word order on a paragraph, clear separation of paragraph boundaries and semantic roles of each word. After extracting content from the PDF file, extracted texts content written to an xml document. XML file contains tags to recognize the pages and rotation angle and number of images on each page. Sample set of PDF invoices extracted and calculated the extracted word percentage to evaluate the accuracy of this technique. This tool hits 94.27% accuracy rate according to the results.

Show full item record