|
Extracting data of variable lengthIn PDF and Text files, transactional data isn't structured uniformly, as in a CSV, database or XML file. Data can be located anywhere on a page. Therefore, data are extracted from a certain region on the page. However, the data can be spread over multiple lines and multiple pages:
How to exclude lines from an extraction is explained in another topic: Extracting transactional data (see From a PDF or Text file). Text file: setting the height to 0If the variable part in a TXT file is at the end of the record (for example, the body of an email) the height of the region to extract can be set to 0. This instructs the DataMapper to extract all lines starting from a given position in a record until the end of the record, and store them in a single field. This also works with the data.extract() method in a script; see extract(). Finding a conditionWhere it isn't possible to use a setting to extract data of variable length, the key is to find one or more differences between lines that make clear how big the region is from where data needs to be extracted. Using a Condition step or Multiple Conditions stepUsing a Condition step (Condition step) or a Multiple Conditions step (Multiple Conditions step) one could determine how big the region is that contains the data that needs to be extracted. Fields cannot be used twice in one extraction workflow.
Create and edit the Extract step in the 'true' branch, then right-click the step on the Steps pane, select Copy Step, and paste the step in the 'false' branch. Now you only have to adjust the region from which this Extract step extracts data.
To learn how to configure a Condition step or a Case in a Multiple Conditions step, see Configuring a Condition step. Using a scriptA script could also provide a solution when data needs to be extracted from a variable region. This requires using a Javascript-based field.
ExampleThe following script extracts data from a certain region in a Text file; let's assume that this region contains the unit price. If the unit price is empty (after trimming any spaces), the product description has to be extracted from two lines; else the product description should be extracted from one line.
The fourth parameter of the With a Text file, the data.extract() method accepts 0 as its height parameter. With the height set to 0 it extracts all lines starting from the given position until the end of the record.
Note that this script replicates exactly what can be done in a Condition step. In cases like this, it is recommended to use a Condition step. Only use a script when no steps are sufficient to give the expected result, or when the extraction can be better optimized in a script. |
|