Boundaries Using JavaScript

As soon as you select the On Script option as the trigger for establishing record boundaries, you are instructing the DataMapper to read the file sequentially and to trigger an event each and every time it hits a delimiter. In other words, the script that you'll be writing will be executed - by default - as many times as there are delimiters in your input data.

If you know, for instance, that your PDF file only contains documents that are 3-pages, your script could keep count of the number of times it's been called since the last boundary was set (i.e. the count of delimiters it encountered) and each time the count is a multiple of 3, it could set a new record boundary. You would basically have recreated the same functionality that is already available through the GUI when setting the trigger to On Page and specifying 3 as the Number of Pages.

Remember that your script is being called on each new delimiter encountered by the DataMapper parsing algorithm. If you are dealing with a DB Query that returns a million record, the script will be executed a million times! So you have to craft your script in such a way that it doesn't waste too much time examining all possible conditions. Instead, it should terminate as soon as any condition it's evaluating is false.

Data available inside each event

Every time the event is triggered, it has access to the entire data between the current location and the next delimiter. So if you are just beginning the process for a PDF or text file, you have access to the first page only. For CSV/DB, it means you have access to the one record line at the current location.

What this means?

You can:

  • Examine the data found in between delimiters for specific conditions.
  • Examine specific regions of that data, or the available data as a whole.
  • Compare the contents of one region with another.

In fact, make all the comparisons you want as long as it's all located in the data between the current location and the next delimiter.

What happens when the record boundaries depend on data found on different pages, within the same record?

The API allows your script to "remember", across delimiters, the values that were evaluated in previous pages so you can easily set record boundaries that span over hundreds of delimiters (or more).

Examples

Basic example using a CSV file
In this first example, don't focus on the actual syntax being used. You can take a look at the API reference later on for more information.

​Imagine you are a classic rock fan and you want to extract the data from a CSV listing of all the albums in your collection. Your goal is to extract records that change whenever the artist OR the release year changes.

Here's what the CSV looks like:

"Artist","Album","Released"

"Beatles","Abbey Road",1969

"Beatles","Yellow Submarine",1969

"Led Zeppelin","Led Zeppelin 1",1969

"Led Zeppelin","Led Zeppelin 2",1969

"Beatles","Let it be",1969

"Rolling Stones","Let it bleed",1969

"Led Zeppelin","Led Zeppelin 3",1970

"Led Zeppelin","Led Zeppelin 4",1971

"Rolling Stones","Sticky Fingers",1971

The first line is just the header with the names of the CSV columns. Obviously, the data is already sorted per year, per artist, and per album.

​Your goal is to examine two values in each CSV record and to act when either changes. The DataMapper GUI allows you to specify a On Change trigger, but you can only specify a single field. So for instance, if you were to set the record boundary when the "Released" field changes, then you'd get the first four lines together inside a single record, but that's not what you want since that would include albums from several different artists. And if you were to set it when the "Artist" field changes, then the first few records would be OK but near the end, you'd get both the Led Zeppelin 3 and led Zeppelin 4 albums inside the same record, even though they were released on different years. So that's no good either.

Essentially, we need to combine both these conditions and set the record boundary when EITHER the year OR the artist changes.

​Here's what the script would look like:​

// Read the values of both columns we want to ​check
var zeBand = boundaries.get(region.createRegion("Artist"));
var zeYear = boundaries.get(region.createRegion("Released"));

// Check that at least one of our variables holding previous values have been initialized already, before attempting to compare the values

if (boundaries.getVariable("lastBand")!=null) {
  if ( zeBand[0]!=boundaries.getVariable("lastBand")
  || zeYear[0]!=boundaries.getVariable("lastYear") )
  {
    boundaries.set();
  }
}
boundaries.setVariable("lastBand",zeBand[0]);
boundaries.setVariable("lastYear",zeYear[0]);
  • ​The script first reads the two values from the input data, using the createRegion() API method. For a CSV/DB data type, the parameter it expects is simply the column name. The region is then passed as a parameter to the get() method, which reads its contents and converts it into an array of strings (because any region, even a CSV field, may contain several line​s).​
  • To "remember" the values that were processed the last time the event was triggered, we use variables that remain available in between events. Note that these variables are specific to the Boundary context and not available in any other scripting context in the DataMapper.
  • The script first checks if those values were initialized. If they weren't, it means this is the first iteration so there's no need to compare the current values with previous values since there have been none yet. But if they have already been initialized, then a condition checks if either field has changed since last time. If that's the case, then a boundary is created through the set() method.
  • ​Finally, the script stores the values it just read in the variables using the setVariables() method. They will therefore become the "last values encountered" until the next event gets fired. When called, setVariables() creates the specified variable if it doesn't already exist and then sets the value to the second parameter passed to the function.

You can try it yourself. Paste the data into the text editor of your choice and save the file to Albums.csv. Then create a new DataMapper configuration and load this CSV as your data file. In the Data Input Settings, make sure you specify the first row contains field names and set the Trigger to On script. Then paste the above JavaScript code in the Expression field and click the Apply button to see the result.

​Same basic example using a text file

So let's say we want to do the exact same thing, but this time around the Data Source is a plain text file that looks like this:

Beatles Abbey Road 1969
Beatles Yellow Submarine

1968

Led Zeppelin Led Zeppelin 1 1969
Led Zeppelin Led Zeppelin 2 1969
Beatles Let it be

1970

Rolling Stones Let it bleed 1969
Led Zeppelin Led Zeppelin 3 1970
Led Zeppelin Led Zeppelin 4 1971
Rolling Stones Sticky Fingers 1971

 

Then our script would look like this:

// Read the values of both columns we want to check
var zeBand = boundaries.get(region.createRegion(1,1,30,1));
var zeYear = boundaries.get(region.createRegion(61,1,65,1));

// Check that at least one of our variables holding previous values have been initialized already, before attempting to compare the values

if (boundaries.getVariable("lastBand")!=null) {
   if ( zeBand[0]!=boundaries.getVariable("lastBand")
   || zeYear[0]!=boundaries.getVariable("lastYear") )
   {
   boundaries.set();
   }
}
boundaries.setVariable("lastBand",zeBand[0]);
boundaries.setVariable("lastYear",zeYear[0]);
We're using the exact same code we as used for CSV files, with the exception of parameters expected by the createRegion() method. The API methods adapts to their context and therefore expect different parameters to be passed in order to achieve the same thing. Since a text file does not contain column names as a CSV does, the API expects the text regions to be defined using physical coordinates. In this instance: (Left, Top, Right, Bottom).

You can try this code as well. Paste the data into the text editor of your choice and save the file to Albums.txt. Then create a new DataMapper configuration and load this TXT as your data file. In the Data Input Settings, specify On lines as the Page delimiter type with the number of lines set to 1 so you can process the file line per line (i.e. triggering the event on each line). Then, set the boundary Trigger to On script and paste the above code in the JavaScript expression and click the Apply button to see the result.

The PDF context also expects physical coordinates, just like the Text context does, but since PDF pages do not have a grid concept of lines and columns, the above parameters would instead be specified in millimeters relative to the upper left corner of each page. So for instance, to create a region for the Year, the code might look like this:

region.createRegion(190,20,210,25)

which would create a region located near the upper right corner of the page.

That's the only similarity, though, since the script for a PDF would have to look through the entire page and probably make multiple extractions on each one since it isn't dealing with single lines like the TXT example given here.

For more information on the API syntax, please refer to DataMapper API.

 
  • Last Topic Update: 24/01/2017 09:32
  • Last Published: 7/6/2017 : 9:48 AM