Settings pane
Settings for the data source and a list of Data Samples and JavaScript files used in the current data mapping configuration, can be found on the Settings tab at the left. The available options depend on the type of that is loaded. The Input Data settings (especially Delimiters) and are essential to obtain the data and eventually, the output that you need. For more explanation, see Data source settings.
Input Data
The Input Data settings specify how the input data must be interpreted. These settings are different for each data type. For a CSV file, for example, it is important to specify the delimiter that separates data fields. PDF files are already delimited naturally by pages, so the input data settings for PDF files are interpretation settings for text in the file.
CSV file Input Data settings
In a CSV file, data is read line by line, where each line can contain multiple fields. The input data settings specify to the module how the fields are separated.
- Field separator: Defines
what character separates each field in the file. Even though CSV stands for comma-separated values, CSV can actually refer to files where fields are separated using any character, including commas, tabs, semicolons, and pipes.
- Text delimiter: Defines
what character surrounds text in the file, preventing the Field
separator from being interpreted within those text delimiters. This ensures that, for example, the field “Smith; John” is not interpreted as two fields, even if the field delimiter is the semicolon.
- Comment delimiter: Defines
what character starts a comment line.
- Encoding: Defines what
encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8,
UTF-16, UTF-16BE or UTF-16LE ).
- Lines to skip: Defines
a number of lines in the CSV that will be skipped and not
used as records.
- Set tabs as a field separator:
Overwrites the Field separator option and sets the Tab character instead
for tab-delimited files.
- First row contains field names: Uses the
first line of the CSV as headers, which automatically names all extracted
fields.
- Ignore unparseable lines:
Ignores any line that does not correspond to the settings above.
Excel file Input Data settings
There are no settings for field separation in an Excel file, only settings with regards to the file as a whole.
- Lines to skip: Defines
a number of lines in the Excel file that will be skipped and not
used as records.
- First row contains field names: Check this option to use the first line of the Excel file as headers. This option automatically names all extracted fields.
- Sheet: Only one sheet can be selected as the data source.
PDF file Input Data settings
PDF Files have a natural, static delimiter in the form of pages, so the options here are interpretation settings for text in the PDF file. The Input Data settings for PDF files determine how words, lines and paragraphs are detected in the PDF when creating data selections. Each value represents a fraction of the average font size of text in a data selection, meaning "0.3" represents 30% of the height or width.
- Word spacing: Determines the spacing between words. As PDF text spacing is somehow done through positioning instead of actual text spaces, text position is what is used to find new words. This option determines what percentage of the average width of a single character needs to be empty to consider a new word has started. The default value is 0.3, meaning a space is assumed if there is a blank area of 30% of the width of the average character in the font.
- Line spacing: Determines the spacing between lines of text. The default value is 1, meaning the space between lines must be equal to at least the average character height.
- Paragraph spacing: Determines the spacing between paragraphs. The default value is 1.5, meaning the space between paragraphs must be equal to at least 1.5 times the average character height to start a new paragraph.
- Magic number: Determines the tolerance factor for all of the above values. The tolerance is meant to avoid rounding errors. If two values are more than 70% away from each other, they are considered distinct; otherwise they are the same. For example, if two characters have a space of exactly the width of the average character, any space of between 0.7 and 1.43 of this average width is considered one space. A space of 1.44 is considered to be 2 spaces.
- PDF file color space: Determines if the PDF if displayed in Color or Monochrome in the . Monochrome display is faster in the Data . This has no influence on the actual data extraction or the data mapping performance.
Database Input Data settings
Databases all return the same type of information. Therefore the Input Data options for a database refer to the database itself instead of to the data. The following settings apply to any database or ODBC Data Sample.
- Connection String: Displays
the connection string used to access the Data Source.
- Table: Displays the tables
and stored procedures available in the database. The selected table
is the one the data is extracted from. Clicking on any of the tables shows the first line of the data in that table.
- Encoding: Defines what
encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8,
UTF-16, UTF-16BE or UTF-16LE ).
- Browse button : Opens the Edit
Database configuration dialog, which can replace the existing database
data source with a new one. This is the same as using the Replace
feature in the Data Samples window.
- Custom SQL button : Click to open
the SQL Query Designer (see SQL Query Designer) and type
in a custom SQL query. If the database supports stored procedures, including inner joins, grouping and sorting, you can use custom SQL to make a selection from the database, using whatever language the database supports.
Text file Input Data settings
Because text files have many different shapes and sizes, there are many options for the input data in these files.
- Encoding: Defines what
encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8,
UTF-16, UTF-16BE or UTF-16LE ).
- Selection/Text is based on bytes: Check for text files that use double-bytes characters (resolves width issues in some text files).
- Add/Remove characters:
Defines the number of characters to add to, or remove from, the head of the data stream. The spin buttons can also increment or decrement the value. Positive values add blank characters while negative values remove characters.
- Add/Remove lines: Defines the number of lines to add to, or remove from, the head of the data stream. The spin buttons can also increment or decrement the value. Positive values add blank lines while negative values remove lines.
- Maximum line length: Defines the number of columns on a data page. The spin buttons can also increment or decrement the value. The maximum value for this option is 65,535 characters. The default value is 80 characters. You should tune this value to the longest line in your input data. Setting a maximum data line length that greatly exceeds the length of the longest line in your input data may increase execution time.
- Page delimiter type: Defines the delimiter between each page of data. Multiples of such pages can be part of a record, as defined by the Boundaries.
- On lines: Triggers a new page in the Data Sample after a number of lines.
- Cut on number of lines: Triggers a new page after the given number of lines. With this number set to 1, and the Boundaries set to On delimiter, it is possible to create a record for each and every line in the file.
- Cut on FF: Triggers a new page after a Form Feed character.
- On text: Triggers a new page in the Data Sample when a specific string is found in a certain location.
- Word to find: Compares the text value with the value in the data source.
- Match case: Activates
a case sensitive text comparison.
- Location:
Choose Selected area or Entire width to use the value of the current data selection as the
text value.
- Left/Right:
Use the spin buttons to set the start and stop columns to the current data
selection (Selected area) in the record.
- Lines before/after:
This option places the delimiter a certain number of lines before or
after the current line. This is useful if the text that triggers
the delimiter is not on the first line of each page.
- Text from right to left: Sets the writing direction of the data source to right-to-left.
- Expand tabs to spaces: Replaces tabs with the given number of spaces.
- Ignore CR/LF at end of file: Instructs the DataMapper to ignore the CR, LF or CR/LF characters when they are the last characters in a file. This prevents the addition of an unintended trailing record when the data mapping configuration is set up to cut on every line.
XML File Input Data settings
For an XML file you can either choose to use the root node, or select an element type, to create a new delimiter every time that element is encountered.
- Use root element: Locks the
XML Elements option to the top-level element. No other boundaries can be set. If there is only one top-level
element, there will only be one record.
- XML elements: Displays a list
containing all the elements in the XML file. Selecting an element causes
a new page of data to be created every time an instance of this element is encountered.
The information contained in all of the selected parent nodes will be copied for each instance of that node. For example, if a client node contains multiple invoice nodes, the information for the client node can be duplicated for each invoice.
The DataMapper only extracts elements for which at least one value or attribute value is defined in the file.
Boundaries
Boundaries are the division between records: they define where one record ends and the next record begins; for an explanation see Record boundaries.
CSV, Excel or Database file boundaries
Since database data sources are structured the same way as CSV and Excel files, the options for these file types are identical.
- Record limit: Defines how many records are displayed in the Data Viewer. This does not affect output production; when generating output, this option is ignored. To disable the limit, use the value 0 (zero).
- Line limit: Defines the limit of detail lines in any detail table. This is useful for files with a high number of detail lines, which in the DataMapper interface can slow down things. This does not affect output production; when generating output, this option is ignored. To disable the limit, use the value 0 (zero).
- Trigger: Defines the type of rule that controls when a boundary is set, creating a new record.
- Record(s) per page:
Defines a fixed number of lines in the file that go in each
record.
- Records: The number
of records (lines, rows) to put in each record.
- On change: Defines
a new record when a specific field (Field name) has a new value.
- Field name: Displays
the fields in the top line. The boundaries are set on the selected field name.
- On script: Defines
the boundaries using a custom JavaScript. For more
information see Setting boundaries using JavaScript.
- On
field value: Sets a boundary on
a specific field value.
- Field name: Displays
the fields in the top line. The value of the selected field is compared with the Expression below to create a new boundary.
- Expression: Enter
the value or Regular Expression to compare the field value to.
- Use Regular Expression: Treats the Expression as a regular expression
instead of static text. For more information on using Regular
Expressions (regex), see the Regular-Expressions.info
Tutorial.
PDF file boundaries
For a PDF file, Boundaries determine how many pages are included in each record. You can set this up in one of three ways: by giving a static number of pages; by checking a specific area on each page for text changes, specific text, or the absence of text; or by using an advanced script.
- Record limit: Defines how
many records are displayed in the Data Viewer. To disable the
limit, use the value 0 (zero).
- Trigger: Defines the
type of rule that controls when a boundary is set, creating a new record.
- On page: Defines a
boundary on a static number of pages.
- Number of pages:
Defines how many pages go in each record.
- On text: Defines a
boundary on a specific text comparison.
- Start coordinates (x,y):
Defines the left and top coordinates of the data selection
to compare with the text value.
- Stop coordinates (x,y):
Defines the right and bottom coordinates.
- Use Selection: Select an area in the Data Viewer and
click the Use selection button to set the start and stop coordinates to the current
data selection.
In a PDF file, all coordinates are in millimeters.
- Times condition found: When the boundaries are based on the presence of specific text, you can specify after how many instances of this text the boundary can be effectively defined. For example, if a string is always found on the first and on the last page of a document, you could specify a number of occurrences of 2. This way, there is no need to inspect other items for whether it is on the first page or the last page. Having found the string two times is enough to set the boundary.
- Pages before/after:
Defines the boundary a certain number of pages before or after
the current page. This is useful if the text triggering the boundary is not located on the first page of the record.
- Operator: Selects
the type of comparison (for example, "contains").
- Word to find: Compares the
text value with the value in the data source.
- Match case: Makes
the text comparison case sensitive.
Text file boundaries
For a text file, Boundaries determine how many 'data pages' are included in each record. These don't have to be actual pages, as is the case with PDF files. The data page delimiters are set in the Text file Input Data settings.
- Record limit: Defines how many records are displayed in the Data Viewer. This does not affect output production; when generating output, this option is ignored. To disable the limit, use the value 0 (zero).
- Selection/Text is based on bytes: Select this option for text records with fixed width fields whose length is based on the number of bytes and not the number of characters.
- Trigger: Defines the
type of rule that controls when a boundary is set, creating a new record.
- On delimiter: Defines a
boundary on a static number of pages.
- Occurrences: The number of times that the delimiter is encountered before fixing the boundary. For example, if you know that your documents always have four pages delimited by the FF character, you can set the boundaries after every four delimiters.
- On text: Defines a boundary
on a specific text comparison.
- Location:
- Selected area:
- Select the area button: Uses the value of the current data selection as the text
value. Making a new selection and clicking on Select the area will redefine the location.
- Left/Right: Defines where to find the text value in the row.
- Top/Bottom: Defines the start and end row of the data selection to compare with the text value.
- Entire width:
Ignores the column values and compares using the whole line.
- Entire height: Ignores the row values and compares using the whole column.
- Entire page: Compares
the text value on the whole page. Only available with
contains ,
not contains , is
empty and is not empty
operators.
- Times condition found: When the boundaries are based on the presence of specific text, you can specify after how many instances of this text the boundary can be effectively defined. For example, if a string is always found on the first and on the last page of a document, you could specify a number of occurrences of 2. This way, there is no need to inspect other items for whether it is on the first page or the last page. Having found the string two times is enough to set the boundary.
- Delimiters before/after: Defines the boundary a certain number of data pages before or after
the current data page. This is useful if the text triggering the boundary is not located on the first data page of the record.
- Operator: Selects the
type of comparison (for example, "contains").
- Word to find: Compares the text
value with the value in the data source.
- Use selected text button: copies the text in the current selection as the one to compare to it.
- Match case: Makes the
text comparison case sensitive.
- On script: Defines
the boundaries using a custom JavaScript. For more
information see Setting boundaries using JavaScript.
XML file boundaries
The delimiter for an XML file is a node. The Boundaries determine how many of those nodes go in one record. This can be a specific number, or a variable number if the boundary is to be set when the content of a specific field or attribute within a node changes (for example when the invoice_number field changes in the invoice node).
- Record limit: Defines how many records are displayed in the Data Viewer. This does not affect output production; when generating output, this option is ignored. To disable the limit, use the value 0 (zero).
- Trigger: Defines the
type of rule that controls when a boundary is set, creating a new record.
- On Element: Defines
a new record on each new instance of the XML element selected
in the Input Data settings.
- Occurrences: The number of times that the element is encountered before fixing the boundary.
- On Change: Defines
a new record when a specific field or attribute in the XML element
has a new value.
- Field: Displays the fields and (optionally) attributes in the XML element. The value of the selected field determines the new boundaries.
- Also extract element attributes: Check this option to include attribute values in the list of content items that can be used to trigger a boundary.
Data samples
The Data Sample area displays a list of all the imported Data Samples that are available in the current data mapping configuration. As many Data Samples as necessary can be imported to properly test the configuration. Only one of the data samples - the active data sample - is shown in the Data
Viewer.
A number of buttons let you manage the Data Samples. In addition to using the buttons listed below, you can right-click a file to bring up the menu, which offers the same options plus the Copy and Paste options.
Data samples can be copied and pasted to and from the Settings pane using Windows File Explorer.
- Add : Add a new Data Sample from an external data source. The new Data Sample will need to be of the same data type as the current one. For example, you can only add PDF files to a PDF data mapping configuration. Multiple files can be added simultaneously.
- Delete : Remove the current
Data Sample from the data mapping configuration.
- Replace : Open a Data
Sample and replace it with the contents of a different data source.
- Reload : Reload
the currently selected Data Sample and any changes that have been
made to it.
- Set as Active : Activates
the selected Data Sample. The active data sample is shown in the Data
Viewer after it has gone through the Preprocessor step as well as the Input Data and Boundary settings.
Editor Data Format
The Editor Data Format setting is only available for Excel files.
External JS Libraries
Right-clicking in the box
brings up a control menu, with the same options as are available through the buttons on the right.
- Add : Add a new
external library. Use the standard Open dialog to browse and open
the .js file.
- Delete : Remove the currently
selected library from the data mapping configuration.
- Replace : Open a library
and replace it with the contents of a different js file.
- Reload : Reload the currently selected library and any changes that have been
made to it.
Default Data Format
The Default Data Format settings defined here apply to any new extraction in made in the current data mapping configuration. Any format already defined for an existing field remains untouched. It is also possible to set a default format for dates and currencies in the user preferences (Datamapper preferences). Specific settings for a field that contains extracted data are made via the properties of the Extract step that the field belongs to (see Editing fields).
- Negative Sign Before : A negative sign will be displayed before any negative value.
- Decimal Separator : Set the decimal separator for a numerical value.
- Thousand Separator : Set the thousand separator for a numerical value.
- Currency Sign : Set the currency sign for a currency value.
- Date Format : Set the date format for a date value.
- Date Language : Set the date language for a date value (ex: If English is selected, the term May will be identified as the month of May).
- Treat empty as 0 : A numerical empty value is treated as a 0 value.
Default data formats tell the DataMapper how certain types of data are formatted in the data source. They don't determine how these data are formatted in the Data Model or in a template. In the Data Model, data are converted to the native data type. Dates, for example, are converted to a DateTime object in the Data Model, and will always be shown as "year-month-day" plus the time stamp, for example: 2012-04-11 12.00 AM.
|
|