Settings pane
Settings for the data source and a list of Data Samples and JavaScript files used in the current data mapping configuration, can be found on the Settings tab at the left. The available options depend on the type of data sample that is loaded.
The Input Data settings (especially Delimiters) and Boundaries are essential to obtain the data and eventually, the output that you need. For more explanation, see Data source settings.
Input Data
The Input Data settings specify how the input data must be interpreted. These settings are different for each data type. For a CSV file, for example, it is important to specify the delimiter that separates data fields. PDF files are already delimited naturally by pages, so the input data settings for PDF files are interpretation settings for text in the file.
CSV file Input Data settings
In a CSV file, data is read line by line, where each line can contain multiple fields. The input data settings specify to the DataMapper module how the fields are separated.
- Field separator: Defines what character separates each field in the file. Even though CSV stands for comma-separated values, CSV can actually refer to files where fields are separated using any character, including commas, tabs, semicolons, and pipes.
- Text delimiter: Defines what character surrounds text in the file, preventing the Field separator from being interpreted within those text delimiters. This ensures that, for example, the field “Smith; John” is not interpreted as two fields, even if the field delimiter is the semicolon.
- Comment delimiter: Defines what character starts a comment line.
- Encoding: Defines what encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).
- Lines to skip: Defines a number of lines in the CSV that will be skipped and not used as records.
- Set tabs as a field separator: Overwrites the Field separator option and sets the Tab character instead for tab-delimited files.
- First row contains field names: Uses the first line of the CSV as headers, which automatically names all extracted fields.
- Ignore unparseable lines: Ignores any line that does not correspond to the settings above.
- Skip empty lines: Ignore any line that has no content. Note that spaces are considered content.
-
-
Sort on: Select a field on which to sort the data, in ascending (A-Z) or descending (Z-A) order. Note that sorting is always textual. Even if the selected column has numbers, it will be sorted as a text.
-
Excel file Input Data settings
There are no settings for field separation in an Excel file, only settings with regards to the file as a whole.
- Lines to skip: Defines a number of lines in the Excel file that will be skipped and not used as records.
- First row contains field names: Check this option to use the first line of the Excel file as headers. This option automatically names all extracted fields.
- Sheet: Only one sheet can be selected as the data source.
- Skip empty lines: Ignore any line that has no content. Note that spaces are considered content.
-
-
Sort on: Select a field on which to sort the data, in ascending (A-Z) or descending (Z-A) order. Note that sorting is always textual. Even if the selected column has numbers, it will be sorted as a text.
-
PDF file Input Data settings
These settings also apply to files that are converted to PDF by the DataMapper: PS, PCL and AFP files.
PDF Files have a natural, static delimiter in the form of pages, so the options here are interpretation settings for text in the PDF file.
The Input Data settings for PDF files determine how words, lines and paragraphs are detected in the PDF when creating data selections.
Each value represents a fraction of the average font size of text in a data selection, meaning "0.3" represents 30% of the height or width.
- Word spacing: Determines the spacing between words. As PDF text spacing is somehow done through positioning instead of actual text spaces, text position is what is used to find new words. This option determines what percentage of the average width of a single character needs to be empty to consider a new word has started. The default value is 0.3, meaning a space is assumed if there is a blank area of 30% of the width of the average character in the font.
- Line spacing: Determines the spacing between lines of text. The default value is 1, meaning the space between lines must be equal to at least the average character height.
- Paragraph spacing: Determines the spacing between paragraphs. The default value is 1.5, meaning the space between paragraphs must be equal to at least 1.5 times the average character height to start a new paragraph.
- Magic number: Determines the tolerance factor for all of the above values. The tolerance is meant to avoid rounding errors. If two values are more than 70% away from each other, they are considered distinct; otherwise they are the same. For example, if two characters have a space of exactly the width of the average character, any space of between 0.7 and 1.43 of this average width is considered one space. A space of 1.44 is considered to be 2 spaces.
- PDF file color space: Determines if the PDF if displayed in Color or Monochrome in the Data Viewer. Monochrome display is faster in the Data Viewer. This has no influence on the actual data extraction or the data mapping performance.
- Processing method: This option determines which search method is used by the Goto step's Next Occurrence of option. The most recent method searches for text within the specified constraints. It is more precise and more reliable than the Original method which searches for the target text in the entire page and then determines if the text appears within the specified constraints.
Both methods may sometimes return slightly different values. Data mapping configurations made with previous version of the software are therefore set to use the original method by default. It is however recommended to modify them to use the newer method. - Auto-Rotation (PS files only): If in the original PostScript file the orientation of a page and its text don't match, this option sets the page's orientation to match the orientation of the text.
Database Input Data settings
Databases all return the same type of information. Therefore the Input Data options for a database refer to the database itself instead of to the data.
The following settings apply to any database or ODBC Data Sample.
- Connection String: Displays the connection string used to access the Data Source.
- Browse button : Opens the Edit Database configuration dialog, which can replace the existing database data source with a new one. This is the same as using the Replace feature in the Data Samples window.
- Table: Displays the tables and stored procedures available in the database. The selected table is the one the data is extracted from. Clicking on any of the tables shows the first line of the data in that table.
- Custom SQL button : Click to open the SQL Query Designer (seeSQL Query Designer) and type in a custom SQL query. If the database supports stored procedures, including inner joins, grouping and sorting, you can use custom SQL to make a selection from the database, using whatever language the database supports.
The query may contain variables and properties, so that the selection will be dynamically adjusted each time the data mapping configuration is actually used in a Workflow process; seeUsing variables and properties in an SQL query. - Encoding: Defines what encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).
- Sort on: Allows to select a field on which to sort the data, in ascending (A-Z) or descending (Z-A) order. Note that sorting is always textual. Even if the sorting column has numbers, it will be sorted as a text.
With a Custom Query, this option is not available. - Skip empty lines: Ignore any row that has no content, e.g. only nulls or empty strings. Note that spaces are characters and thus considered content.
Text file Input Data settings
Because text files have many different shapes and sizes, there are many options for the input data in these files.
- Encoding: Defines what encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).
- Selection/Text is based on bytes: Check for text files that use double-bytes characters (resolves width issues in some text files).
- Add/Remove characters: Defines the number of characters to add to, or remove from, the head of the data stream. The spin buttons can also increment or decrement the value. Positive values add blank characters while negative values remove characters.
- Add/Remove lines: Defines the number of lines to add to, or remove from, the head of the data stream. The spin buttons can also increment or decrement the value. Positive values add blank lines while negative values remove lines.
- Maximum line length: Any line that is longer than the given maximum line length will be split at the maximum line length, as often as necessary. This option is used to cut (and wrap) long lines into logical blocks of data.
The maximum value for this option is 65,535 characters. The default value is 80 characters. - Page delimiter type: Defines the delimiter between each page of data. Multiples of such pages can be part of a record, as defined by the Boundaries.
- On lines: Triggers a new page in the Data Sample after a number of lines.
- Cut on number of lines: Triggers a new page after the given number of lines. With this number set to 1, and the Boundaries set to On delimiter, it is possible to create a record for each and every line in the file.
- Cut on FF: Triggers a new page after a Form Feed character.
- On text: Triggers a new page in the Data Sample when a specific string is found in a certain location.
- Word to find: Compares the text value with the value in the data source.
- Match case: Activates a case sensitive text comparison.
- Location: Choose Selected area or Entire width to use the value of the current data selection as the text value.
- Left/Right: Use the spin buttons to set the start and stop columns to the current data selection (Selected area) in the record.
- Lines before/after: This option places the delimiter a certain number of lines before or after the current line. This is useful if the text that triggers the delimiter is not on the first line of each page.
- On lines: Triggers a new page in the Data Sample after a number of lines.
- Text from right to left: Sets the writing direction of the data source to right-to-left.
- Expand tabs to spaces: Replaces tabs with the given number of spaces.
- Ignore CR/LF/FF at end of file: Instructs the DataMapper to ignore any CR, LF, FF or CR/LF characters when they are the last characters in a file. This prevents the addition of an unintended trailing record when the data mapping configuration is set up to cut on every line.
XML File Input Data settings
For an XML file you can either choose to use the root node, or select an element type, to create a new delimiter every time that element is encountered.
- Use root element: Selects the top-level element. No other boundaries can be set. If there is only one top-level element, there will only be one record.
- Use specific element: Displays a list containing all the elements in the XML file. Selecting an element causes a new page of data to be created every time an instance of this element is encountered.
Note that higher-level nodes above the selected element are shown, but those below the selected element are not displayed unless the Show all elements option is checked. - Use XPath: Enter an XPath to create a delimiter based on the node name of elements. For example:
./*[starts-with(name(),'inv')]
sets a delimiter after every element of which the name starts with 'inv'. Note thatstarts-with()
is an XPath function. For an overview of XPath functions, see Mozilla: XPath Functions.
The XPath may also contain JavaScript code. In order to use JavaScript:- The XPath must start with =
- The entire JavaScript statement must be enclosed in curly brackets: {...}
- Any other curly brackets that are not part of the JavaScript code must be escaped with a backslash.
- Single line comments (//...) in the code are not supported.
Currently, XPaths that select elements based on an attribute, attribute value, node value, node counter or node index are not supported.
-
Show all elements: When the delimiter is set to a specific element or XPath, checking this option allows to extract information from higher-level nodes, including those that follow the element or path. This might slow down the processing, so if you don't need any information from the higher-level nodes that follow that specific element, it is recommended to leave this option unchecked.
When this option is used in combination with a trigger element that is not repeated at the same node level (in other words, it doesn't have a sibling with the same name), the entire XML document will be shown for each record, except the trigger element, which will only be shown for the record that is currently selected in the Data Model pane.
This could lead to a problem with some steps that use an XPath with absolute indexes, such as a location-based Extract step. Using a dynamic index in the XPath will fix the problem. For example, in the case of a location-based Extract step, switch to extracting the data via JavaScript (see Expression-based field); in the JavaScript expression, replace the index of the element's parent in the XPath with record.index.
The information contained in all of the selected parent nodes will be copied for each instance of that node. For example, if a client node contains multiple invoice nodes, the information for the client node can be duplicated for each invoice.
The DataMapper only extracts elements for which at least one value or attribute value is defined in the file.
JSON File Input Data settings
For a JSON file you can either use the object or array at the root and get one output record, or select an object or array as parent element. Its direct child elements - objects and arrays, not key-value pairs - can be output as individual records.
- Use root element: Selects the top-level array or object. There will only be one record.
- Use specific element: Select an array [ ] or object { } in the JSON data as Parent element to define its child elements - objects and/or arrays - as source records. Any elements outside the parent element and key-value pairs inside the parent will be repeated in each source record.Only arrays and objects can be seen as a record. It is not possible to split the JSON between key-value pairs.
Boundaries
Boundaries are the division between records: they define where one record ends and the next record begins; for an explanation see Record boundaries.
CSV, Excel or Database file boundaries
Since database data sources are structured the same way as CSV and Excel files, the options for these file types are identical.
- Record limit: Defines how many records are displayed in the Data Viewer. This does not affect output production; when generating output, this option is ignored. To disable the limit, use the value 0 (zero).
- Line limit: Defines the limit of detail lines in any detail table. This is useful for files with a high number of detail lines, which in the DataMapper interface can slow down things. This does not affect output production; when generating output, this option is ignored. To disable the limit, use the value 0 (zero).
- Trigger: Defines the type of rule that controls when a boundary is set, creating a new record.
- Record(s) per page: Defines a fixed number of lines in the file that go in each record.
- Records: The number of records (lines, rows) to put in each record.
- On change: Defines a new record when a specific field (Field name) has a new value.
- Field name: Displays the fields in the top line. The boundaries are set on the selected field name.
- On script: Defines the boundaries using a custom JavaScript. For more information see Setting boundaries using JavaScript.
- On field value: Sets a boundary on a specific field value.
- Field name: Displays the fields in the top line. The value of the selected field is compared with the Expression below to create a new boundary.
- Expression: Enter the value or Regular Expression to compare the field value to.
- Use Regular Expression: Treats the Expression as a regular expression instead of static text. For more information on using Regular Expressions (regex), see theRegular-Expressions.info Tutorial.
- Record(s) per page: Defines a fixed number of lines in the file that go in each record.
PDF file boundaries
For a PDF file, Boundaries determine how many pages are included in each record. You can set this up in one of three ways: by giving a static number of pages; by checking a specific area on each page for text changes, specific text, or the absence of text; or by using an advanced script.
-
Record limit: Defines how many records are displayed in the Data Viewer. To disable the limit, use the value 0 (zero).
-
Trigger: Defines the type of rule that controls when a boundary is set, creating a new record.
- On page: Defines a boundary on a static number of pages.
- Number of pages: Defines how many pages go in each record.
- On text: Defines a boundary on a specific text comparison.
- Start coordinates (x,y): Defines the left and top coordinates of the data selection to compare with the text value.
- Stop coordinates (x,y): Defines the right and bottom coordinates.
- Use Selection: Select an area in the Data Viewer and click the Use selection button to set the start and stop coordinates to the current data selection.
Note:In a PDF file, all coordinates are in millimeters. - Times condition found: When the boundaries are based on the presence of specific text, you can specify after how many instances of this text the boundary can be effectively defined. For example, if a string is always found on the first and on the last page of a document, you could specify a number of occurrences of 2. This way, there is no need to inspect other items for whether it is on the first page or the last page. Having found the string two times is enough to set the boundary.
- Pages before/after: Defines the boundary a certain number of pages before or after the current page. This is useful if the text triggering the boundary is not located on the first page of the record.
- Operator: Selects the type of comparison (for example, "contains").
- Word to find: Compares the text value with the value in the data source.
- Match case: Makes the text comparison case sensitive.
- On page: Defines a boundary on a static number of pages.
Text file boundaries
For a text file, Boundaries determine how many 'data pages' are included in each record. These don't have to be actual pages, as is the case with PDF files. The data page delimiters are set in the Text file Input Data settings.
- Record limit: Defines how many records are displayed in the Data Viewer. This does not affect output production; when generating output, this option is ignored. To disable the limit, use the value 0 (zero).
- Selection/Text is based on bytes: Select this option for text records with fixed width fields whose length is based on the number of bytes and not the number of characters.
- Trigger: Defines the type of rule that controls when a boundary is set, creating a new record.
- On delimiter:Defines a boundary on a static number of pages.
- Occurrences: The number of times that the delimiter is encountered before fixing the boundary. For example, if you know that your documents always have four pages delimited by the FF character, you can set the boundaries after every four delimiters.
- On text: Defines a boundary on a specific text comparison.
- Location:
- Selected area:
- Select the areabutton: Uses the value of the current data selection as the text value. Making a new selection and clicking on Select the area will redefine the location.
- Left/Right: Defines where to find the text value in the row.
- Top/Bottom: Defines the start and end row of the data selection to compare with the text value.
- Entire width: Ignores the column values and compares using the whole line.
- Entire height: Ignores the row values and compares using the whole column.
- Entire page: Compares the text value on the whole page. Only available with
contains
,not contains
,is empty
andis not empty
operators.
- Selected area:
- Times condition found: When the boundaries are based on the presence of specific text, you can specify after how many instances of this text the boundary can be effectively defined. For example, if a string is always found on the first and on the last page of a document, you could specify a number of occurrences of 2. This way, there is no need to inspect other items for whether it is on the first page or the last page. Having found the string two times is enough to set the boundary.
- Delimiters before/after: Defines the boundary a certain number of data pages before or after the current data page. This is useful if the text triggering the boundary is not located on the first data page of the record.
- Operator: Selects the type of comparison (for example, "contains").
- Word to find: Compares the text value with the value in the data source.
- Use selected text button: copies the text in the current selection as the one to compare to it.
- Match case: Makes the text comparison case sensitive.
- Location:
- On script: Defines the boundaries using a custom JavaScript. For more information see Setting boundaries using JavaScript.
- On delimiter:Defines a boundary on a static number of pages.
XML file boundaries
The delimiter for an XML file is a node. The Boundaries determine how many of those nodes go in one record. This can be a specific number, or a variable number if the boundary is to be set when the content of a specific field or attribute within a node changes (for example when the invoice_number field changes in the invoice node).
- Record limit: Defines how many records are displayed in the Data Viewer. This does not affect output production; when generating output, this option is ignored. To disable the limit, use the value 0 (zero).
- Trigger: Defines the type of rule that controls when a boundary is set, creating a new record.
- On Element: Defines a new record on each new instance of the XML element selected in the Input Data settings.
- Occurrences: The number of times that the element is encountered before fixing the boundary.
- On Change: Defines a new record when a specific field or attribute in the XML element has a new value.
- Field: Displays the fields and (optionally) attributes in the XML element. The value of the selected field determines the new boundaries.
- Also extract element attributes: Check this option to include attribute values in the list of content items that can be used to trigger a boundary.
- On Element: Defines a new record on each new instance of the XML element selected in the Input Data settings.
JSON file boundaries
The delimiter for a JSON file is an object or array inside the selected parent element (see JSON File Input Data settings). The Boundaries determine how many of them go in one record.
- Record limit: Defines how many records are displayed in the Data Viewer. This does not affect output production; when generating output, this option is ignored. To disable the limit, use the value 0 (zero). The default value is 200.
- Trigger: Defines the type of rule that controls when a boundary is set, creating a new record.
- On element: Creates a new record in the output for each object or array, or - if you set a higher number of occurrences - after every n-th object or array in the parent element.
- Occurrences: The number of times that an element is encountered in the parent element before fixing the boundary.
On change: Creates a new record each time the value in a certain key-value pair changes.
- Field: Displays the keys of key-value pairs that exist at the root of direct child elements of the selected parent element. The value of the selected field determines the new boundaries.
- On element: Creates a new record in the output for each object or array, or - if you set a higher number of occurrences - after every n-th object or array in the parent element.
Data samples
The Data Sample area displays a list of all the imported Data Samples that are available in the current data mapping configuration. As many Data Samples as necessary can be imported to properly test the configuration.
Only one of the data samples - the active data sample - is shown in the Data Viewer.
A number of buttons let you manage the Data Samples.
In addition to using the buttons listed below, you can right-click a file to bring up the context menu, which offers the same options plus the Copy and Paste options.
To select or deselect multiple Data Samples, keep the Ctrl key pressed down while clicking on them, or keep the Shift key pressed down to select consecutive Data Samples.
Data samples can be copied and pasted to and from the Settings pane using Windows File Explorer.
- Add : Add a new Data Sample from an external data source. The new Data Sample will need to be of the same data type as the current one. For example, you can only add PDF files to a PDF data mapping configuration. Multiple files can be added simultaneously.
- Delete : Remove the current Data Sample from the data mapping configuration.
- Move up : Move the selected Data Sample(s) up the list.
- Move down : Move the selected Data Sample(s) down the list.
- Replace : Open a Data Sample and replace it with the contents of a different data source.
- Reload : Reload the currently selected Data Sample and any changes that have been made to it.
- Set as Active : Activates the selected Data Sample. The active data sample is shown in the Data Viewer after it has gone through the Preprocessor step as well as the Input Data and Boundary settings.
Editor Data Format
The Editor Data Format setting is only available for Excel files.
- Date Display Format: This setting specifies how dates must be displayed in the Data Viewer. Note that extracting a Date value will only be successful if the expected date format matches the actual format of a date in the Data Viewer (see:Data format settings.)
- Excel Default Format: Displays dates and times the way they would be displayed in Excel, using the specified locale. For date formats without a locale, the US English locale is used. (In the format selection dialog in Excel, these date formats are marked with an asterisk.) Values can show a date, or a time, or both.
Connect always uses this setting when opening an Excel file. - Current Locale Settings: Shows dates and times as formatted by Windows using the current Locale of the system on which Connect runs. All values are shown as a date including a time.
- ISO 8601 (UTC): Uses the ISO 8601 (UTC) format to display dates and times. All values are shown as a date including a time, taking the time zone into account. Note that when no time was specified with a date in the original file, the default time (12.00 AM) is used and converted; this may influence the displayed date.
Some Korean and Chinese date formats can't be parsed yet, and won't display correctly with any of these settings.
- Excel Default Format: Displays dates and times the way they would be displayed in Excel, using the specified locale. For date formats without a locale, the US English locale is used. (In the format selection dialog in Excel, these date formats are marked with an asterisk.) Values can show a date, or a time, or both.
External JS Libraries
Right-clicking in the box brings up a control menu, with the same options as are available through the buttons on the right.
- Add : Add a new external library. Use the standard Open dialog to browse and open the .js file.
- Delete : Remove the currently selected library from the data mapping configuration.
- Replace : Open a library and replace it with the contents of a different js file.
- Reload : Reload the currently selected library and any changes that have been made to it.
Default Data Format
The Default Data Format settings defined here apply to any new extraction in made in the current data mapping configuration. Any format already defined for an existing field remains untouched.
It is also possible to set a default format for dates and currencies in the user preferences (DataMapper preferences).
Specific settings for a field that contains extracted data are made via the properties of the Extract step that the field belongs to (see Editing fields).
- Negative Sign Before: Any value in a numeric field that has a "-" sign is interpreted as a negative value.
- Decimal Separator: Set the decimal separator for a numerical value.
- Thousand Separator: Set the thousand separator for a numerical value.
- Currency Sign: Set the currency sign for a currency value.
- Date Format: Set the date format for a date value.
- Automatic: Select this option to parse dates automatically, without specifying a format. This is the default setting for new Date fields.
- ISO8601: This setting allows for dates with different timestamp formats, or belonging to different time zones, to be parsed inside a single job. Dates that do not include a specific time are automatically considered to use the current locale's time zone.
Select the ISO template to be used when parsing the timestamp. Other ISO8601 formats can be handled via the Custom option. - Custom: Set a custom date format. For the markers available in the DataMapper see Defining a date/time format.
- Language: Set the language for a date value (ex: If English is selected, the term May will be identified as the month of May).
- Use offset from UTC: Select the default time zone, which is to be used to extract any timestamp that does not already include time zone information with the time.
- Treat empty as 0: A numerical empty value is treated as a 0 value.
Default data formats tell the DataMapper how certain types of data are formatted in the data source. They don't determine how these data are formatted in the Data Model or in a template. In the Data Model, data are converted to the native data type. Dates, for example, are converted to a DateTime object in the Data Model, and will always be shown as "year-month-day" plus the time stamp, for example: 2012-04-11 12.00 AM.