Using the wizard for PDF/VT or AFP files
The pages in PDF/VT and AFP files can be grouped on several levels. Additional information can be attached to each level in the structure. The structure and additional information are stored in the file's metadata.
The DataMapper wizard for PDF files lets you select a level to trigger the start of a new record and it also enables you to extract the additional information from the metadata. You can extract data from the content afterwards.
To extract information from the metadata in the extraction workflow itself, you have to create a JavaScript extraction (see Using scripts in the DataMapper and extractMeta()).
If the PDF doesn't contain any metadata, each page is a new record - in other words, a boundary is set at the start of a new page -, which is exactly what happens when you open the file without a wizard.
You can open a PDF/VT or AFP file with a wizard using the Welcome screen or the File menu.
- From the Welcome screen
- Open the PlanetPress ConnectWelcome page by clicking the icon at the top right or select the Help menu and then Welcome.
- Click New DataMapper Configuration.
- From the Using a wizard pane, select PDF/VT or AFP.
- Click the Browse button and open the PDF/VT or AFP file you want to work with. Click Next.
- From the File menu
- In the menu, click File > New.
- Click the Data mapping Wizards drop-down and select From PDF/VT or AFP.
- Click Next.
- Click the Browse button and open the PDF/VT or AFP file you want to work with. Click Next.
After selecting the file, select the following options in the Metadata page: Metadata record levels: Use the drop-down to select what level in the metadata defines a record.Field List: This list displays all fields on the chosen level and higher levels in the PDF/VT or AFP metadata. The right column shows the field name. The left column displays the level on which it is located. Check any field to add it to the extraction.
Click Finish to close the dialog and open the actual Data Mapping configuration.
On the Settings pane, you will see that the boundary trigger is set to On metadata. The selected metadata fields are added to the Data Model.
Extracting data from a PDF that comes from a Windows printer queue (a PDF converted to PostScript, converted back to PDF by an Input task in Workflow) might not work (see the Connect Knowledge Base.)
The rule of thumb is: if copy-paste from Acrobat works, so will data mapping; if not, the DataMapper won't either.
Rotated pages in a PDF are supported (if rotated 0/90/180/270 degrees). The Extract step will be able to extract data from horizontal and vertical lines of text on rotated pages. Motion steps (such as the Repeat step and the Goto step) however, can only work as expected if text on a page has the same orientation as the page, not when text has been rotated after the page was rotated.
The page number and rotation of a page are shown in the status bar at the bottom, next to the region selection information.