Extracting data of variable length

In PDF and Text files, transactional data isn't structured uniformly, as in a CSV, database or XML file. Data can be located anywhere on a page. Therefore, data are extracted from a certain region on the page. The data can be spread over multiple lines and multiple pages, however:

  • Line items may continue on the next page, separated from the line items on the first page by a line break, a number of empty lines and a letterhead.
  • Data may vary in length: a product description for example may or may not fit on one line.

How to exclude lines from an extraction is explained in another topic: Extracting transactional data (see From a PDF or Text file).
This topic explains a few ways to extract data with variable lengths.

Finding a condition

The key to extracting data of variable length is to find one or more differences between lines that make clear how big the region is from where data needs to be extracted.
Whilst, for example, a product description may expand over two lines, other data - such as the unit price - will never be longer than one line. Either the line above or below the unit price will be empty when the product description covers two lines.
Such a difference can then be used as a condition in a Condition step or a Case in a Multiple Conditions step.
A Condition step, as well as each Case in a Multiple Conditions step, can only check for one condition. To combine conditions, you would need a script.

Using a Condition step or Multiple Conditions step

Using a Condition step (Condition step) or a Multiple Conditions step (Multiple Conditions step) one could determine how big the region is that contains the data that needs to be extracted.
In each of the branches under the Condition or Multiple Conditions step, an Extract step could be added to extract the data from a particular region. The Extract steps could write their data to the same field.

Fields cannot be used twice in one extraction workflow.
Different Extract steps can only write extracted data to the same field in the Data Model, if:

  • The field name is the same. (See: Renaming and ordering fields.)
  • The Extract steps are mutually exclusive. This is the case when they are located in different branches of a Condition step or Multiple Conditions step.
  • The option Append values to current record is checked in the Step properties pane under Extraction Definition.
Create and edit the Extract step in the 'true' branch, then right-click the step on the Steps pane, select Copy Step, and paste the step in the 'false' branch. Now you only have to adjust the region from which this Extract step extracts data.

To learn how to configure a Condition step or a Case in a Multiple Conditions step, see Configuring a Condition step.


Using a script

A script could also provide a solution when data needs to be extracted from a variable region. This requires using a Javascript-based field.

  1. Add a field to an Extract step, preferably by extracting data from one of the possible regions; see Extracting data. To add a field without extracting data, see Adding a JavaScript-based field.
  2. On the Step properties pane, under Field Definition, select the field and change its Mode to Javascript.
    If the field was created with its Mode set to Location, you will see that the script already contains one line of code to extract data from the original location.
  3. Expand the script. Start by doing the check(s) to determine where the data that needs to be extracted is located. Use the data.extract() function to extract the data. The parameters that this function expects depend on the data source, see extract().
Example

The following script extracts data from a certain region in a Text file; let's assume that this region contains the unit price. If the unit price is empty (after trimming any spaces), the product description has to be extracted from two lines; else the product description should be extracted from one line.

var s = data.extract(1,7,1,2,"");
if (s.substring(1,3).trim().length == 0)
{ data.extract(12,37,0,2,""); } /* extract two lines */
else { data.extract(12,37,0,1,""); } /* extract one line */

The fourth parameter of the extract() function contains the height of the region. When working with a Text file, this equals a number of lines.

Note that this script replicates exactly what can be done in a Condition step. In cases like this, it is recommended to use a Condition step. Only use a script when no steps are sufficient to give the expected result, or when the extraction can be better optimized in a script.

 
  • Last Topic Update: 07:27 AM Jun-19-2017
  • Last Published: 2019-05-22 : 2:51 PM