Show simple item record

dc.contributor.advisor Perera I
dc.contributor.author Bandara RMCV
dc.date.accessioned 2020
dc.date.available 2020
dc.date.issued 2020
dc.identifier.uri http://dl.lib.uom.lk/handle/123/16780
dc.description.abstract organization better control over their information processes. When a business expands, more documents will be produced, and it needs to be carefully handled and tracked to make good use of. Output management systems that are working with ERP systems contains thousands of business documents and Portable document format (PDF) is the common output format for these types of documents. These systems need to execute documents search operations frequently. PDF documents Indexing is a critical part in this context. It will boost document search engine efficiency by cutting search space. Content extraction from PDF documents goes a step further and it will allow more structured search queries. Extracting the document content from a PDF file is a very important. But this is a very challenging task because PDF is a layout-based format that defines the fonts and locations of the individual character as opposed to the semantic units of the text and their role within the document. In this research I have developed a technique to extract content from a PDF file. We can use it for allow more structured search queries on large document archives in output management systems typically work with world leading ERP systems. On this research mainly considered on four aspects which are correctly identifying words, word order on a paragraph, clear separation of paragraph boundaries and semantic roles of each word. After extracting content from the PDF file, extracted texts content written to an xml document. XML file contains tags to recognize the pages and rotation angle and number of images on each page. Sample set of PDF invoices extracted and calculated the extracted word percentage to evaluate the accuracy of this technique. This tool hits 94.27% accuracy rate according to the results. en_US
dc.language.iso en en_US
dc.subject COMPUTER SCIENCE AND ENGINEERING-Dissertations en_US
dc.subject COMPUTER SCIENCE-Dissertations en_US
dc.subject DATA PROCESSING, BUSINESS en_US
dc.subject BUSINESS COMMUNICATION-Portable Document Format en_US
dc.subject AUTOMATIC CONTENT EXTRACTION en_US
dc.title Content extraction from PDF invoices on business document archives en_US
dc.type Thesis-Full-text en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree MSc in Computer Science en_US
dc.identifier.department Department of Computer Science & Engineering en_US
dc.date.accept 2020
dc.identifier.accno TH4255 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record