Extracting data of variable length
In PDF and Text files, transactional data isn't structured uniformly, as in a CSV, database or XML file. Data can be located anywhere on a page. Therefore, data are extracted from a certain region on the page. However, the data can be spread over multiple lines and multiple pages:
- Line items may continue on the next page, separated from the line items on the first page by a page break, a number of empty lines and a letterhead.
- Data may vary in length: a product description for example may or may not fit on one line.
How to exclude lines from an extraction is explained in another topic: Extracting transactional data (see From a PDF or Text file).
This topic explains a few ways to extract a variable number of lines.
Text file: setting the height to 0
If the variable part in a TXT file is at the end of the record (for example, the body of an email) the height of the region to extract can be set to 0. This instructs the DataMapper to extract all lines starting from the current position in a record until the end of the record, and store them in a single field.
This also works with the data.extract() method in a script; see extract().
Finding a condition
Where it isn't possible to use a setting to extract data of variable length, the key is to find one or more differences between lines that make clear how big the region is from where data needs to be extracted.
Whilst, for example, a product description may extend over two lines, other data - such as the unit price - will never be longer than one line. Either the area above or the one below the unit price will be empty when the product description covers two lines.
Such a difference can then be used as a condition in a Condition step or a Case in a Multiple Conditions step.
A Condition step, as well as each Case in a Multiple Conditions step, can only check for one condition. To combine conditions, you would need a script.
Using a Condition step or Multiple Conditions step
Using a Condition step (Condition step) or a Multiple Conditions step (Multiple Conditions step) one could determine how big the region is that contains the data that needs to be extracted.
In each of the branches under the Condition or Multiple Conditions step, an Extract step could be added to extract the data from a particular region. The Extract steps could write their data to the same field.
Data cannot be extracted more than once in any record, unless the Extract steps are mutually exclusive. This is the case when they are located in different branches of a Condition step or Multiple Conditions step.
Inside a Detail table, multiple Extract steps may extract the same data but each of them will create a new child record in the Detail table.
If you tick the Append values to current record option when several steps are extracting the same field, the step will error out.
Create and edit the Extract step in the 'true' branch, then right-click the step on the Steps pane, select Copy Step, and paste the step in the 'false' branch. Now you only have to adjust the region from which this Extract step extracts data.
To learn how to configure a Condition step or a Case in a Multiple Conditions step, see Configuring a Condition step.
Using a script
A script could also provide a solution when data needs to be extracted from a variable region. This requires using a Javascript-based field.
- Add a field to an Extract step, preferably by extracting data from one of the possible regions; see Extracting data. To add a field without extracting data, see Expression-based field.
- On the Step properties pane, under Field Definition, select the field and change its Mode to Javascript.
If the field was created with its Mode set to Location, you will see that the script already contains one line of code to extract data from the original location. - Expand the script. Start by doing the check(s) to determine where the data that needs to be extracted is located. Use the
data.extract()
function to extract the data. The parameters that this function expects depend on the data source, see extract().
Example
The following script extracts data from a certain region in a Text file; let's assume that this region contains the unit price. If the unit price is empty (after trimming any spaces), the product description has to be extracted from two lines; else the product description should be extracted from one line.
var s = data.extract(1,7,1,2,"");
if (s.substring(1,3).trim().length == 0)
{ data.extract(12,37,0,2,""); } /* extract two lines */
else { data.extract(12,37,0,1,""); } /* extract one line */
The fourth parameter of the extract()
function contains the height of the region. When working with a Text file, this equals a number of lines.
With a Text file, the data.extract()
method accepts 0 as its height parameter. With the height set to 0 it extracts all lines starting from the given position until the end of the record.
Note that this script replicates exactly what can be done in a Condition step. In cases like this, it is recommended to use a Condition step. Only use a script when no steps are sufficient to give the expected result, or when the extraction can be better optimized in a script.