Search Functionality
The PDF Extractor enables you to do a search in the GUI as well as at runtime. Below is a summary of the text-finding functionally:
•The Split object and the Location/Boundary Assignment support the Find Text method that allows searching for text and identifying a split position relative to this text.
•The Find Text method enables you to specify various filtering options. For example, you can search for text of a particular font face, size and weight.
•The Group/Filter object can group PDF data by text that is found or not found on a page.
•You can also do a search in the PDF View and Output panes, using the Find Dialog. Depending on the pane, the text-finding features vary (see details below).
Find dialog
You can search for text in the PDF VIew and Output panes of the PDF Extractor. To invoke the Find dialog, click inside a pane of interest and press Ctrl + F. You can also access the dialog via the Edit | Find menu command or via the toolbar.
Find dialog in the Output pane
The Find dialog shown below appears in the Output pane of the PDF Extractor. The Find options can be specified via buttons located below the search term field (screenshot below). When an option is toggled on, its button color changes to blue (see the Find Anchor button in the screenshot below).
Find options
You can select from the following options:
•Match case: Case-sensitive search when toggled on (Address is not the same as address).
•Match whole word: Only the exact words in the text will be matched. For example, for the input string fit, with Match whole word toggled on, only the word fit will match the search string; the fit in fitness, for example, will not be matched.
•Regular expression: If toggled on, the search term will be read as a regular expression. See Regular expressions below for a description of how regular expressions are used.
•Filter results: Select one or more document components where the search is to be carried out.
•Find anchor: Found items are indexed in document order and the index of the currently selected item is given in the Find dialog. For example, from the information in the screenshot above, we can tell that the second found item from four is currently selected. Clicking Find Next (highlighted at bottom right in the screenshot) takes you to the next found item in index order. However, if the Find Anchor option is selected, Find Next takes you to the next found item relative to the current cursor position. So, if the currently selected item is the first (say, 1 of 4) and you were to place the cursor after item 3, then Find Next would take you to item 4—and not to item 2 (as would have happened if Find Anchor was toggled off).
•Find in selection: When toggled on, locks the current text selection and restricts the search to the selection. Otherwise, the entire document is searched. Before selecting a new range of text, unlock the current selection by toggling off the Find in Selection option.
Switch back and forth between search results
All search results are highlighted in the Output pane (see below). You can use the Back and Forward buttons to switch back and forth between the search results.
Regex
You can use regular expressions (regex) to find a text string. To do this, follow the steps below:
1.First, switch the Regular expression option on (see Find options above). This specifies that the text in the search term field is to be evaluated as a regular expression.
2.Next, enter the regular expression in the search term field. For help with building a regular expression, click the Regular Expression Builder button, which is located to the right of the search term field (screenshot below).
3.Then click an item in the Builder to enter the corresponding regex metacharacter/s in the search term field. The screenshot below shows a simple regular expression to find anything before the string king. For a brief description of metacharacters, see the section Regular expression metacharacters below.
Regular expression metacharacters
Given below is a list of regular expression metacharacters.
. | Matches any character. This is a placeholder for a single character. |
( | Marks the start of a tagged expression. |
) | Marks the end of a tagged expression. |
(abc) | The ( and ) metacharacters mark the start and end of a tagged expression. Tagged expressions may be useful when you need to tag ("remember") a matched region for the purpose of referring to it later (back-reference). Up to nine expressions can be tagged (and then back-referenced later, either in the Find or Replace field).
For example, (the) \1 matches the string the the. This expression can be literally explained as follows: match the string "the" (and remember it as a tagged region), followed by a space character, followed by a back-reference to the tagged region matched previously. |
\n | Where n is a variable that can take integer values from 1 through 9. The expression refers to the first through ninth tagged region when replacing. For example, if the find string is Fred([1-9])XXX and the replace string is Sam\1YYY, this means that in the find string there is one tagged expression that is (implicitly) indexed with the number 1; in the replace string, the tagged expression is referenced with \1. If the find-replace command is applied to Fred2XXX, it would generate Sam2YYY. |
\< | Matches the start of a word. |
\> | Matches the end of a word. |
\x | Allows you to use a character x, which would otherwise have a special meaning. For example, \[ would be interpreted as [ and not as the start of a character set. |
[...] | Indicates a set of characters. For example, [abc] means any of the characters a, b or c. You can also use ranges: for example [a-z] for any lower case character. |
[^...] | The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character. |
^ | Matches the start of a line (unless used inside a set, see above). |
$ | Matches the end of a line. Example: A+$ to find one or more A's at end of line. |
* | Matches 0 or more times. For example, Sa*m matches Sm, Sam, Saam, Saaam and so on. |
+ | Matches 1 or more times. For example, Sa+m matches Sam, Saam, Saaam and so on. |
Representation of special characters
Note the following expressions.
\r | Carriage Return (CR). You can use either CR (\r) or LF (\n) to find or create a new line |
\n | Line Feed (LF). You can use either CR (\r) or LF (\n) to find or create a new line |
\t | Tab character |
\\ | Use this to escape characters that appear in regex expression, for example: \\\n |
Find dialog in the PDF View pane
You can also do a search in the PDF View pane. The Find dialog in the PDF View pane is illustrated below. In this dialog, you can enable the Match Case and Match Whole Word options. For details, see Find Options above.
Search results in PDF View pane
Search results are highlighted in the PDF View pane (screenshot below). You can use the Back and Forward buttons to jump back and forth between the search results.
Actions with search results
You can also right-click any of the search results and select a suitable option from the context menu:
For a description of the options in the context menu, see Selection Modes.