PDF text extraction tolerance factors

When extracting text from a PDF (for example, through a data selection), a lot more happens in the background than what can be seen on the surface. Reading a PDF file for text will generally return text fragments, separated by a certain amount of space. Sometimes the text will be shifted up or down, spacing will be different, etc. In some cases, every letter is considered to be a different fragment.

Text formatting features such as kerning, bold, exponential, etc, may cause these fragments to be considered as separate even if, to the naked eye, they obviously belong together.

The PDF Text Extraction Tolerance Factors is used to modify the behavior of data selections made from PDF data files from within PlanetPress Workflow. Each factor available in this window will determine if two fragments of text in the PDF should be part of the same data selection or not.

The default values are generally correct for the greatest majority of PDF data files. Only change these values if you understand what they are for.

Delta Width

Defines the tolerance for the distance between two text fragments, either positive (space between fragments) or negative (kerning text where letters overlap). When this value is at 0, the two fragments will need to be exactly one beside the other with no space or overlap between them.

When this value is at 1, a very large space or overlap will be accepted. This may case "false positives" and separate words and text blocks may be considered as a single word if the value is too high.

Accepted values range from 0 to 1. The default value is 0.3, recommended values are between 0.05 and 0.30.

Delta Height

Defines the tolerance for the height and position difference between two target fragments. The higher the number, the more difference between the fragment's height (the tallest font character's height) will be accepted and the more vertical distance between fragments are accepted. Exponents, for example, are higher and lower.

When this value is 0, no vertical shift is accepted between two fragments. When the value is 1, the second text fragment can be shifted by as much as the height of the first fragment.

Accepted values range from 0 to 1. The default value is 0.15, recommended values are between 0.00 and 0.50.

Font Delta Height

Defines the tolerance for the difference in average height of fonts in the two target fragments. The higher the number, the more difference in average font heights will be accepted. The average font height is bigger in text written in uppercase than text written in lowercase.

At 0, the font size must be exactly the same between two fragments. At 1, a greater variance in font size is accepted.

Accepted values range from 0 to 1. The default value is 0.65, recommended values are between 0.60 and 1.00.

Gap

Defines how spaces between two fragments are processed. If the space between two fragments is too small, the text extraction will sometimes eliminate that space and count the two fragments as a single word. To resolve this, the Gap setting can be changed. The lower this value, the higher the chance of a space being added between two characters. A value too low may add spaces where they do not belong.

Accepted values range from 0 to 0.5. The default value is 0.3, recommended values are between 0.25 and 0.40.