The Settings Pane

The Delimiter, Boundary setting, and a list of Data Samples used in the current data mapping configuration can be found under the Settings tab. The available options depend on the type of data sample that is loaded. For more information about Delimiters and Boundaries, see Configuring The Data Source (Settings).

The Input Data (Delimiters)

Delimiters are borders that naturally separate blocks of data in the Data Sample and they differ for each data type. For example, a CSV is delimited by record, and PDF files are delimited naturally by pages.

For a CSV File

In a CSV file, data is read line by line, where each line can contain multiple fields. Even though CSV stands for comma-separated values, CSV can actually refer to files where fields are separated using a number of separators, including commas, tabs, semicolons, pipes or any other character. The input data selection is required so you can specify to the DataMapper module how the fields are separated. This is done using the field Separator. The other important option is the Text delimiter, which is used to wrap around each field just in case the field values contain the field separator. This ensures that, for example, the field “Smith John” is not interpreted as two fields, even if the field delimiter is the semicolon.

  • Field separator: Defines what character separates each fields in the file.
  • Text delimiter: Defines what character surrounds text fields in the file, preventing the Field separator from being interpreted within those text delimiters.
  • Comment delimiter: Defines what character starts a comment line.
  • Encoding: Defines what encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).
  • Lines to skip: Defines a number of lines in the CSV that will be skipped and not used as Source Records.
  • Set tabs as a field separator: Overwrites the Field separator option and sets the Tab character instead for tab-delimited files.
  • First row contains field names: Uses the first line of the CSV as headers, which automatically names all extracted fields.
  • Ignore unparseable lines: Ignores any line that does not correspond to the settings above.

For a PDF File

PDF files already have a clear and unmovable delimiter: pages. So the settings in the input area are not used to set delimiters of PDF files. Instead, this opportunity can be taken to add some options on how text is read from the PDF when creating data selections. These options determine how PDF words, lines and paragraphs are detected. For instance, the line spacing option determines the spacing between lines of text. The default value is "1", meaning the space between the top of each line must be equal to at least the average character height.


PDF Files have a natural, static delimiter in the form of Pages, so the options here are interpretation settings for text in the PDF file. Each value represents a fraction of the average font size of text in a data selection, meaning "0.3" represents 30% of the height or width.
  • Word spacing: Determines the spacing between words. As PDF text spacing is somehow done through positioning instead of actual text spaces, text position is what is used to find new words. This option determines what percentage of the average width of a single character needs to be empty to consider a new word has started. Default value is "0.3", meaning a space is assumed if there is a blank area of 30% of the width of the average character in the font.
  • Line spacing: Determines the spacing between lines of text. The default value is 1, meaning the space between lines must be equal to at least the average character height.
  • Paragraph spacing: Determines the spacing between paragraphs. The default value is 1.5, meaning the space between paragraphs must be equal to at least "1.5" times the average character height to start a new paragraph.
  • Magic number: Determines the tolerance factor for all of the above values. The tolerance is meant to avoid rounding errors. If two values are more than 70% away from each other, they are considered distinct; otherwise they are the same. For example, if two characters have a space of exactly one times the width of the average character, any space of between "0.7" and "1.43" of this average width is considered one space. A space of "1.44" is considered to be 2 spaces.
  • PDF file color space: Determines if the PDF if displayed in Color or Monochrome in the Data Viewer. Monochrome display is faster in the Data Viewer, but this has no influence on actual data extraction or the data mapping performance.

For Databases

Since data is being taken from a database instead of a data file, the input data option refers more to the database itself rather than how to interpret the data. After all, databases all return the same type of information. Because a database generally contains multiple tables, they can all be listed here. Clicking on any of the tables shows the first line of the data in that table. If you need more power, click on the Custom SQL button and work on your database using whatever language the database supports. If it supports stored procedures, including inner joins, grouping and sorting, it will work perfectly.

The following settings apply to any database or ODBC Data Sample.

  • Connection String: Displays the connection string used to access the Data Source.
  • Table: Displays the tables and stored procedures available in the database. The selected table is the one the data is extracted from.
  • Encoding: Defines what encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).
  • Browse button : Opens the Edit Database configuration dialog, which can replace the existing database data source with a new one. This is the same as using the Replace feature in the Data Samples window.
  • Custom SQL button : Click to open the SQL Query Designer and type in a custom SQL query.

For a Text File

Because text files have many different shapes and sizes, there are a lot more options for the input data in these files. You can add or remove characters in lines if you have a big header you want to get rid of, or really weird characters at the beginning of your file. Set a line width if you are still working with old line printer data and so on. It is still important, however, that pages are defined properly. This can be done either by using a set number of lines or using the “P” character - or if your data is a bit more complex, to detect text on the page. Be careful that these are not Boundary settings but rather page settings in order to make sure you are configuring these options to detect each new page and not each new Source Record.

  • Encoding: Defines what encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).
  • Selection/Text is based on bytes: Check for text files that use double-bytes characters (resolves width issues in some text files).
  • Add/Remove characters: Defines the number of characters to add to, or remove from, the head of the data stream. The spin buttons can also increment or decrement the value. Positive values add blank characters while negative values remove characters.
  • Add/Remove lines: Defines the number of lines to add to, or remove from, the head of the data stream. The spin buttons can also increment or decrement the value. Positive values add blank lines while negative values remove lines.
  • Maximum line length: Defines the number of columns on a data page. The spin buttons can also increment or decrement the value. The maximum value for this option is 65,535 characters. The default value is 80 characters. You should tune this value to the longest line in your input data. Setting a maximum data line length that greatly exceeds the length of the longest line in your input data may increase execution time.
  • Page delimiter type: Defines the delimiter between each page of data. Multiples of such pages can be part of a Source Record, as defined by the Boundaries.
    • On lines: Triggers a new page in the Data Sample after a static number of lines (called Lines per Page), or using a Form Feed character.
    • On text: Triggers a new page in the Data Sample when a specific string is found in a certain location.
      • Word to find: Compares the text value with the value in the Source Record.
      • Match case: Activates a case-sensitive text comparison.
      • Location: Choose Selected area or Entire width to use the value of the current data selection as the text value.
      • Left/Right: Use the spin buttons to set the start and stop columns to the current data selection (Selected area) in the Source Record.
      • Lines before/after: Defines the delimiter a certain number of lines before or after the current line. This is useful if the text triggering the delimiter is not on the first line of each page.

For a XML File

XML is a special file format because these file types can have theoretically an unlimited number of structure types. The input data has two simple options that basically determine at which node level a new record is created. Use root node uses the complete XML file as a single Source Record. The XML nodes option list all the node. Choosing one creates a new delimiter every time that a node is encountered.

The information contained in all of the selected parent nodes will be copied for each instance of that node. For example, if a client node contains multiple invoice nodes, the information for the client node can be duplicated for each invoice.
  • Use root element: Locks the XML Elements option to the top-level element. If there is only one top-level element, there will only be one record before the Boundaries are set.
  • XML elements: Displays a list containing all the elements in the XML file. Selecting an element causes a new page of data to be created every time an instance of this element is encountered.

The Boundaries

When the Data Source is received by the DataMapper, it has no boundaries to tell the DataMapper if it contains different records or where each of those records begins and ends. This is because boundaries are not actual bits of data (like a character or a field would be). Boundaries are a logical structure outside the Data Source (note that some formats like PDF/VT actually include structured information, but those are the exception rather than the rule). Boundaries are therefore a form of metadata. You could very well use the exact same data with a different boundary structure in order to extract different information. Think, for instance, of an Invoice Run stored in a PDF. You can use a structure where each invoice is a single record or you could group all invoices for one customer into a single record. So the boundaries for each record can be completely dependent on how you want to use the data.

With no actual boundary markers inside the data, there needs to be a way to identify specific locations in the input stream and mark those locations as record boundaries. Fortunately, every single file format has intrinsic, natural delimiters that are used to identify chunks of related data. These delimiters are key in helping us identify boundaries, so it is important to understand what they are as well as when and why they occur in the Data Source.

Let's start with a seemingly arbitrary assumption: a boundary can only occur on a natural delimiter. That is to say, a record boundary never occurs between delimiters; it only occurs on a delimiter. The actual information we need to determine whether a delimiter can be a boundary is very likely to be found between delimiters.

For a CSV or Database File

The natural delimiter for a CSV file is a data record, or to put it more visually, each line in a spreadsheet or in a SQL data grid is a delimiter. Several such delimiters can be included in a record, but you would never expect to find the end of one particular record right in the middle of one of these lines in the grid. So the record occurs with a new line in the grid, but not on each new line.

Since database data sources are structured the same way as CSV files, the options are identical to these files. Boundaries will define how many lines appear for each Source Record. This can be a static number of lines or it can be determined based on a field change that will create a new record. For example, this can happen when the customer ID changes. There is also an advanced scripting option to determine boundaries (see Javascript for Boundaries for details).

  • Record limit: Defines how many Source Records are displayed in the Data Viewer. This does not affect output production, as generating output ignores this option. To disable the limit, use the value "0".
  • Line limit: Defines the limit of detail lines in any detail table. This is useful for files with a high number of detail lines, which in the DataMapper interface can slow down things. This does not affect output production, as generating output ignores this option. To disable the limit, use the value "0".
  • Trigger: Defines the type of rule that controls when a boundary is created, establishing a new record in the Data Sample (called a Source Record).
    • Record(s) per page: Defines a fixed number of lines in the file that go in each Source Record.
      • Records: The number of records to show in each Source Records.
    • On change: Defines a new Source Record when a specific field (Field name) has a new value.
      • Field name: Displays the fields in the top line. The selected value determines new boundaries.
    • On script: Defines the boundaries using a custom user-defined JavaScript. For more information see Boundaries Using javaScript.
    • On field value: Defines the boundary for the contents of a specific field value.
      • Field name: Displays the fields in the top line. The selected value is compared with the Expression below to create a new boundary.
      • Expression: Enter the value or Regular Expression that triggers a new boundary when it is the field value.
      • Use Regular Expression: Treats the Expression as a regular expression instead of static text. For more information on using Regular Expressions (regex), see the Regular-Expressions.info Tutorial.

For a PDF File

Boundaries will determine how many pages are included in each of the Source Records. You can set this up in one of three ways: by giving it a static number of pages; by checking a specific area on each page for text changes, specific text, or the absence of text; or by using an advanced script. For example, you could check if “Page 1 of” appears at the top left of the page, which means it is the first page of each Source Record, regardless of how many pages are actually in the document.

While a record boundary always occurs on a new page, the opposite is not true: a new page is not always a record boundary.
  • Record limit: Defines how many Source Records are displayed in the Data Viewer. To disable the limit, use the value "0".
  • Trigger: Defines the type of rule that defines when a boundary is created, establishing a new record in the Data Sample (called a Source Record).
    • On page: Defines a boundary on a static number of pages.
      • Number of pages: Defines how many pages are in each Source Record.
    • On text: Defines a boundary on a specific text comparison in the Source Record.
      • Start coordinates (x,y): Defines the left and top coordinates of the data selection to compare with the text value.
      • Stop coordinates (x,y): Defines the right and bottom coordinates.
      • Use Selection: Select an area in the Data Viewer and click the Use selection button to set the start and stop coordinates to the current data selection in the Source Record.
      • In a PDF file, all coordinates are in millimeters.
      • Times condition found: When the boundaries are based on the presence of specific text, you can specify after how many instances of this text the boundary can be effectively defined. For example, if a string is always found on the first and on the last page of a document, you could specify a number of occurrences of 2. In this way, no need to inspect other items for whether it is on the first page or the last page. you know you have found the string two times, which is enough to fix the boundary.
      • Pages before/after: Defines the boundary a certain number of pages before or after the current page. This is useful if the text triggering the document is not located on the first page of each Source Record.
      • Operator: Selects the type of comparison (for example, "contains").
      • Word to find: Compares the text value with the value in the Source Record.
      • Match case: Makes the text comparison case-sensitive.

For a Text File

For a Text file, the natural delimiter is also a Page, but contrary to PDF, the Page delimiter can either be explicit (say, when a Form Feed character is encountered in the Data Source) or implicit (when a certain number of lines has been reached, usually around 66). Once more, the end of a record would not be found in the middle of a line. Note also that it is possible with this format to set the DataMapper's Input Data settings to 1 line per page. That essentially allows you to set the natural delimiter on each and every line in the file.

If you select the wrong page at the top, for example, making a new selection and clicking on Select the area will redefine the location. The other option is Use selected text, which simply copies the text in the current selection as the one to compare to it.

  • Record limit: Defines how many Source Records are displayed in the Data Viewer. To disable the limit, use the value "0".
  • Trigger: Defines the type of rule that defines when a boundary is created, establishing a new record in the Data Sample (called a Source Record).
    • On delimiter: Defines a boundary on a static number of pages.
      • Occurrences: The number of times that the delimiter is encountered before fixing the boundary. For example, if you know that your documents always have four pages delimited by the FF character, you can set the boundaries whenever you counted four delimiters.
    • On text: Defines a boundary on a specific text comparison in the Source Record.
      • Location:
        • Selected area:
          • Select the area button: Uses the value of the current data selection as the text value.
          • Left/Right: Defines where to find the text value in the row.
          • Top/Bottom: Defines the start and end row of the data selection to compare with the text value.
        • Entire width: Ignores the column values and compares using the whole line.
        • Entire height: Ignores the row values and compares using the whole column.
        • Entire page: Compares the text value on the whole page. Only available with contains, not contains, is empty and is not empty operators.
      • Times condition found: When the boundaries are based on the presence of specific text, you can specify after how many instances of this text the boundary can be effectively defined. For example, if a string is always found on the first and on the last page of a document, you could specify a number of occurrences of 2. In this way, no need to inspect other items for whether it is on the first page or the last page. you know you have found the string two times, which is enough to fix the boundary.
      • Delimiters before/after: Defines the boundary a certain number of pages before or after the current page. This is useful if the text triggering the document is not located on the first page of each Source Record.
      • Operator: Selects the type of comparison (for example, "contains").
      • Word to find: Compares the text value with the value in the Source Record.
      • Match case: Makes the text comparison case-sensitive.
    • On script: Defines the boundaries using a custom user-defined JavaScript. For more information see Boundaries Using javaScript.

For a XML File

Since we know the delimiter for an XML file is a node, all we need to set for the Boundaries is how many of those nodes we want to use. A specific number can be used, like when we have one invoice node per Source Record─or be determined when the content of a specific field within that node changes (e.g. when the invoice_number field changes in the invoice node).

  • Record limit: Defines how many Source Records are displayed in the Data Viewer. To disable the limit, use the value "0".
  • Trigger: Defines the type of rule for when a boundary is created, establishing a new record in the Data Sample (called a Source Record).
    • On Element: Defines a new Source Record on each new instance of the XML level selected in the XML elements.
      • Occurrences: The number of times that the delimiter is encountered before fixing the boundary. For example, if you know that your documents always have four pages delimited by the FF character, you can set the boundaries whenever you counted four delimiters.
    • On Change: Defines a new Source Record when a specific field under the XML level has a new value.
      • Field: Displays the fields that are under the XML level. The value of the selected fields determines the new boundaries.

The Data Samples

The Data Sample area displays a list of all the imported Data Samples that are available now in the data mapping configuration. As many Data Samples as necessary can be imported to properly test the configuration.

Instead of buttons listed below, you can also right-click to bring up the context menu, which offers the same options.

  • Add : Adds a new Data Sample from an external Data Source. The new Data Sample will need to be of the same data type as the current one. For example, you can only add PDF files to a PDF data mapping configuration. In version 1.3 and higher, multiple files can be added simultaneously through the Add dialog.
  • Delete : Removes the current Data Sample from the data mapping configuration.
  • Replace : Opens a new Data Sample and replaces it with the contents of a different data source.
  • Reload : Reloads the currently selected Data Sample and any changes that have been made to it.
  • Set as Active : Activates the selected Data Sample. The active data sample is shown in the Data Viewer after it has gone through the Preprocessor step as well as the Input Data and Boundary settings.

The External JS Libraries

Right-clicking in the box brings up a control menu, also available through the buttons on the right:

  • Add : Adds a new external library. Opens the standard Open dialog to browse and open the .js file.
  • Delete : Removes the currently selected library from the data mapping configuration.
  • Replace : Opens a new library and replaces it with the contents of a different js file.
  • Reload : Reloads the currently selected library and any changes that have been made to it.

Default Data Format

Default Format Settings can also be defined at the DataMapper configuration level (see DataMapper Default Data Format for more information).

 
  • Last Topic Update: 24/01/2017 10:50
  • Last Published: 7/6/2017 : 9:48 AM