Altova MapForce 2025 Enterprise Edition

The Find Objects method can be particularly useful when there is a lack of edges. The object-finding method will scan the search region, and whenever a search direction coordinate has at least one pixel in the secondary direction which is different enough from the background color, this pixel is counted as being part of an object. Depending on the object's edge or edges you have selected, the splitter will cut the region into snippets, based on these lines. These lines can also be adjusted if necessary. With appropriate setup, the Find objects method can also be used to detect large gaps between lines of text.

 

Properties

The table below summarizes the properties of the object-finding method.

 

Property

Description

Background Color

The Background Color property is the background color of a PDF document and accepts hexadecimal color codes. The default option is #FFF, which stands for white.

 

Tolerance

The Tolerance property is the percentage of color deviation specified in the Background parameter. This is the range within which the background color is still considered a background. Anything above the percentage of color deviation is no longer considered a background. For example, the value 100 means that everything is treated as a background.

 

Minimum Extent

The Minimum Extent property specifies the minimum size of an object; any objects smaller than the value specified will be ignored.

 

Fill Gaps

The Fill Gaps property determines the size of a gap that is covered along the search direction; if two non-background rows are not farther apart than this distance, these rows are considered to be a single object.

 

Edge to Find

The Edge to Find property determines on which edge an object will be split, which can be the beginning (Start), the end (End), or the beginning and end (Start and End) of the object.

 

Displace

The Displace property specifies an offset that will be added to the detected position of an object. The offset is usually negative when the Edge to Find property is set to Start and positive otherwise.

 

 

For an example that uses the Find Objects method, see Example below.

 

Example

This example shows how to configure the Find objects method. The goal of this example is to extract table data from the sample invoice illustrated below.

pdfex_bookinvoice_zoom60

The table shown in the screenshot above does not contain regular grid lines, which makes it difficult to identify correct split positions. Besides, the cells in the second column (No) and the cells in the third column (Description) overlap. In order to correctly split the table into rows, we have selected the Find objects method and configured it as follows:

 

The Background Color and Tolerance properties have default values (#FFF and 10%, respectively).

The Minimum Extent property has been set to 4pt, which helps eliminate objects smaller that this value.

Since there are no gaps that can be filled in, the Fill Gaps property has its default value (0pt).

The Edge to Find property has been set to Start, which means the objects will be split in locations where they start.

By trial and error, we have identified the ideal value of the Displace property, which is -3pt. This value has caused the split positions to move slightly upwards, which will prevent the data from being truncated.

No post-processing options have been defined.

 

Search region

Since there are no consistent lines along which the table could be split into rows, we use the Search region to identify reliable split positions, which will then be applied to the whole Region. The screenshot below shows that the Region contains all the rows of the table (light yellow area). The Region represents an area that we want to split. However, the Search region (bright yellow rectangle below) covers only the first column of the table, in which detecting objects works more reliably than in other parts of the table.

PDFEX_BookInvoiceSearch

If no Search region is used, the splitter will identify the split positions shown below, which will lead to incorrect results in the output.

PDFEX_BookInvoiceNoSearch

 

© 2018-2024 Altova GmbH