Setting boundaries using JavaScript

As soon as you select the On Script option as the trigger for establishing record boundaries (see Record boundaries), you are instructing the DataMapper to read the source file sequentially and to trigger an event each and every time it hits a delimiter. (What a delimiter is, depends on the source data and the settings for that data; see Input data settings (Delimiters)).
In other words, the script will be executed - by default - as many times as there are delimiters in the input data.

If you know, for instance, that a PDF file only contains documents that are 3 pages long, your script could keep count of the number of times it's been called since the last boundary was set (that is, the count of delimiters that have been encountered). Each time the count is a multiple of 3, it could set a new record boundary. This is basically what happens when setting the trigger to On Page and specifying 3 as the Number of Pages.

Remember that a boundary script is being called on each new delimiter encountered by the DataMapper parsing algorithm. If for instance a database query returns a million records, the script will be executing a million times! Craft your script in such a way that it doesn't waste time examining all possible conditions. Instead, it should terminate as soon as any condition it is evaluating is false.

Accessing data

Data available inside each event

Every time a delimiter is encountered, an event is triggered and the script is executed. The event gives the script access to the data between the current location - the start of a row, line or page - and the next delimiter. So at the beginning of the process for a PDF or text file, you have access to the first page only, and for a CSV or for tabular data, that would be the first row or record.

This means that you can:

  • Examine the data found in between delimiters for specific conditions.
  • Examine specific regions of that data, or the available data as a whole.
  • Compare the contents of one region with another.
  • Etc.

To access this data in the script, use the get() function of the boundaries object. This function expects different parameters depending on the type of source file; see get().

Getting access to other data

Data that is not passed with the event, but that is necessary to define the record boundaries, can be stored in the boundaries object using the setVariable function (see boundaries and setVariable()). The data can be retrieved using the boundaries' getVariable function (see getVariable()).
This way the script can access values that were evaluated in previous pages or rows, across delimiters, so you can easily set record boundaries that span over multiple delimiters.

For more information on the syntax, please refer to DataMapper Scripts API.

Examples

Basic example using a CSV file

​Imagine you are a classic rock fan and you want to extract the data from a CSV listing of all the albums in your collection. Your goal is to extract records that change whenever the artist OR the release year changes.

Here's what the CSV looks like:

"Artist","Album","Released"
"Beatles","Abbey Road",1969
"Beatles","Yellow Submarine",1969
"Led Zeppelin","Led Zeppelin 1",1969
"Led Zeppelin","Led Zeppelin 2",1969
"Beatles","Let it be",1969
"Rolling Stones","Let it bleed",1969
"Led Zeppelin","Led Zeppelin 3",1970
"Led Zeppelin","Led Zeppelin 4",1971
"Rolling Stones","Sticky Fingers",1971

The first line is just the header with the names of the CSV columns. The data is already sorted per year, per artist, and per album.

​Your goal is to examine two values in each CSV record and to act when either changes. The DataMapper GUI allows you to specify a On Change trigger, but you can only specify a single field. So for instance, if you were to set the record boundary when the "Released" field changes, you'd get the first four lines together inside a single record. That's not what you want since that would include albums from several different artists. And if you were to set it when the "Artist" field changes, the first few records would be OK but near the end, you'd get both the Led Zeppelin 3 and led Zeppelin 4 albums inside the same record, even though they were released in different years.

Essentially, we need to combine both these conditions and set the record boundary when EITHER the year OR the artist changes.

​Here's what the script would look like:​

/* Read the values of both columns we want to ​check */
var zeBand = boundaries.get(region.createRegion("Artist"));
var zeYear = boundaries.get(region.createRegion("Released"));

/* Check that at least one of our variables holding previous values has been initialized already, before attempting to compare the values */

if (boundaries.getVariable("lastBand")!=null) {
if (zeBand[0] != boundaries.getVariable("lastBand") || zeYear[0] != boundaries.getVariable("lastYear") )
{
boundaries.set();
}
}
boundaries.setVariable("lastBand",zeBand[0]);
boundaries.setVariable("lastYear",zeYear[0]);
  • ​The script first reads the two values from the input data, using the createRegion() method (see: createRegion()). For a CSV/database data type, the parameter it expects is simply the column name. The region is passed as a parameter to the get() method, which reads its contents and converts it into an array of strings (because any region, even a CSV field, may contain several line​s).​
  • To "remember" the values that were processed the last time the event was triggered, we use variables that remain available in between events. Note that these variables are specific to the Boundary context and not available in any other scripting context in the DataMapper.
  • The script first checks if those values were initialized. If they weren't, it means this is the first iteration so there's no need to compare the current values with previous values since there have been none yet. But if they have already been initialized, then a condition checks if either field has changed since last time. If that's the case, then a boundary is created through the set() method.
  • ​Finally, the script stores the values it just read in the variables using the setVariables() method. They will therefore become the "last values encountered" until the next event gets fired. When called, setVariables() creates the specified variable if it doesn't already exist and then sets the value to the second parameter passed to the function.

You can try it yourself. Paste the data into the text editor of your choice and save the file to Albums.csv. Then create a new DataMapper configuration and load this CSV as your data file. In the Data Input Settings, make sure you specify the first row contains field names and set the Trigger to On script. Then paste the above JavaScript code in the Expression field and click the Apply button to see the result.

​Basic example using a text file

This example is similar to the previous example, but now the data source is a plain text file that looks like this:

Beatles Abbey Road 1969
Beatles Yellow Submarine

1968

Led Zeppelin Led Zeppelin 1 1969
Led Zeppelin Led Zeppelin 2 1969
Beatles Let it be

1970

Rolling Stones Let it bleed 1969
Led Zeppelin Led Zeppelin 3 1970
Led Zeppelin Led Zeppelin 4 1971
Rolling Stones Sticky Fingers 1971

The purpose of the script, again, is to set the record boundary when EITHER the year OR the artist changes.

The script would look like this:

/* Read the values of both columns we want to check */
var zeBand = boundaries.get(region.createRegion(1,1,30,1));
var zeYear = boundaries.get(region.createRegion(61,1,65,1));

/* Check that at least one of our variables holding previous values have been initialized already, before attempting to compare the values */

if (boundaries.getVariable("lastBand")!=null) {
(zeBand[0]!=boundaries.getVariable("lastBand") || zeYear[0]!=boundaries.getVariable("lastYear"))
{
boundaries.set();
}
}
boundaries.setVariable("lastBand",zeBand[0]);
boundaries.setVariable("lastYear",zeYear[0]);

This script uses the exact same code as used for CSV files, with the exception of parameters expected by the createRegion() method. The get method adapts to the context (the data source file) and therefore expects different parameters to be passed in order to achieve the same thing. Since a text file does not contain column names as a CSV does, the API expects the text regions to be defined using physical coordinates. In this instance: Left, Top, Right, Bottom.

To try this code, paste the data into a text editor and save the file to Albums.txt. Then create a new DataMapper configuration and load this Text file as your data file. In the Data Input Settings, specify On lines as the Page delimiter type with the number of lines set to 1. When you now set the boundary Trigger to On script, the file will be processed line per line (triggering the event on each line). Paste the above code in the JavaScript expression field and click the Apply button to see the result.

The PDF context also expects physical coordinates, just like the Text context does, but since PDF pages do not have a grid concept of lines and columns, the above parameters would instead be specified in millimeters relative to the upper left corner of each page. So for instance, to create a region for the Year, the code might look like this:

region.createRegion(190,20,210,25)

which would create a region located near the upper right corner of the page.

That's the only similarity, though, since the script for a PDF would have to look through the entire page and probably make multiple extractions on each one since it isn't dealing with single lines like the TXT example given here.

For more information on the API syntax, please refer to DataMapper Scripts API.

 
  • Last Topic Update: 15, June, 2018 01:19 PM
  • Last Published: 13, September, 2019 08:35 AM