Extract PDF Data for Data Mapping

MapForce PDF Extractor

  • Easily define rules to extract data from PDF
  • Create templates to map PDF to other formats
  • Visual, point and click interface
  • Powerful suggestion engine for auto-extraction
  • Extract fragments, mix and match information
  • Sophisticated tools for extracting data in tables
  • Easily arrange data in groups
  • Schema tree for previewing template results
  • Map PDF to Excel, databases, JSON, CSV, & more
  • Automate PDF mapping with MapForce Server

PDF Data Extraction

Though PDF is a ubiquitous data format in business today, data contained in PDFs is not readily available for mapping to other systems. PDFs are typically designed for human-readable content with variable formatting and layouts, making structured data extraction extremely challenging. They may contain text, images, tables, and other elements, and the data is not organized in a machine-readable format. Typical PDF data extraction tools may not provide accurate results, especially for PDFs with complex layouts. That's where the MapForce PDF Extractor comes in.

The MapForce data mapping tool includes the MapForce PDF Extractor, an easy-to-use utility that allows you to quickly define the structure of a PDF document and extract data from it. Then, that PDF data can be accessed for further transformation and conversion to other formats such as XML, JSON, databases, Excel, and so on, in MapForce. It is the ultimate tool for enabling PDF data integration and ETL projects.

Using visual tools in the MapForce PDF Extractor, you can define the structure of a PDF document and efficiently extract its data. PDF Extractor is a highly flexible tool that allows you to extract only portions of text instead of the whole document, mix and match pieces of information from different pages of the same PDF file, split tables into rows, and arrange data into groups.

Access PDF data for conversion with MapForce PDF Extractor

The intuitive, straightforward design of the MapForce PDF Extractor makes it easy to define PDF document structure in a visual way, using point-and-click and drag-and-drop functionality. At last, the vast volumes of data previously locked in PDFs is available for mapping to other formats.

Learn how to use the MapForce PDF Extractor

How to Extract PDF Data

When you load a sample PDF to create a template and define data extraction rules, the PDF is displayed next to a schema pane. The schema pane displays a tree structure that represents how the data will be extracted. The MapForce PDF Extractor includes a powerful suggestion engine that automatically identifies common document elements and attempts to detect their structure.

For instance, the suggestion engine will identify tables that exist in the document, which you can then opt to extract automatically. A split operator in the schema pane helps you define how to correctly divide the table into separate rows. The suggestion engine can look for edges or lines to create the split, or split based on a fixed distance, for example, which you can preview in the PDF view pane. At the same time, the suggestion engine captures columns and header text. Clicking on any object in the schema tree highlights the corresponding structure and data capture rules as they apply in the PDF document view.

How to extract PDF data with MapForce

In large PDF documents with numerous tables, it may be helpful to search for text (such as a header) in the document and define rules for processing data relative to that text. For instance, when creating a template to extract data from yearly financial reports, you could search for “Expenditures” and process the table of figures following that text accordingly. Granular search options such as case sensitivity, format filtering (font, font weight, etc.), and whole or partial word searches, allow for precise targeting.

After the tabular data is extracted, you can adjust the extraction rules as necessary to exclude some fragments, adjust anchor assignments, define table boundaries, and so on. This can be accomplished using visual tools and helpful pull down menus. You can preview the results of data extraction in the output tree to check for accuracy.

Other document elements can be captured and added to your template manually. To define rules for extracting data manually, simply select an area in the PDF to extract by capturing it in a rectangle. Then, select Text Capture from the right-click context menu. PDF Extractor adds the capture as an element in the document tree, and you can drag and drop it to the desired position in the tree.

Extract PDF for data mapping

As you work, the MapForce PDF Extractor builds an XML document representing the structure of your PDF template with sample data from the working PDF document in the output window. This helps you understand and perfect the results of the extraction that will become a template to use in MapForce.

Preview the extracted data

Convert PDF Data

Once you save your template in the MapForce PDF Extractor, you are ready to insert it as a source data component in a MapForce data mapping project. Common PDF conversion requirements include:

  • PDF to Excel
  • PDF to databases (SQL or NoSQL)
  • PDF to JSON
  • PDF to CSV
  • PDF to XML

Of course, MapForce can also mix-and-match with multiple source and target data formats, chained data mapping projects, and more. A rich library of data processing functions and a visual function builder make it easy to filter and process data before writing it to the destination(s).

Convert PDF to JSON in MapForce

With the PDF Extractor, MapForce finally makes critical business data previously locked in PDFs available for data mapping, data integration, and ETL processes.

“Altova MapForce provides excellent mapping capabilities that we can seamlessly embed within our core products. The extensible nature of the product means it covers all of our solution requirements.”

Scott Redford Visionware