I would be happy to explain further for those who have questions. I would also be happy to document my communications with Nuance (ScanSoft) describing "when" and "what" Nuance knew about these problems.
There is an OCR bug in virtually all recent versions of OmniPage. I have confirmed them in both OmniPage v14 and v15.
OmniPage OCR will incorrectly recognize valid text as BULLETED text. The only way to rearrange the text properly is to manually remove the BULLETED format. You have to manually locate and re-enter the text characters that OmniPage thinks are bullets. If you change the format without manually entering these characters, they will be automatically and silently deleted by OmniPage. These characters will then be MISSING from your end work.
This means that CHARACTERS can seem to randomly disappear from your OCR text.
If you are going to save the file as simple TEXT format, the problem will not occur.
You will see this happen in paragraphs starting with single character words. It occurs very frequently in INDEX and TABLE OF CONTENT parts of a document. The problem is that OmniPage will include text that is not part of the BULLETS like random character patterns that seem like bullets.
There is no way to currently avoid this OmniPage bug. It has been reported and SoftScan/Nuance is aware of it. The Technical Support people incorrectly identified it as a "down stream" application bug and took no action. After a letter to the company president was ignored, I finally contacted their PRESS CONTACT people. I received a call from their second line support people and clearly explained the problem.
There is a way to recognize most of the errors but not all. Most of the errors will trigger the BULLETED text check box indicating that bulleted text has been recognized. How to work around this part of the problem is described below.
The second class of problems is that OmniPage OCR will recognize character patterns a a TAB character. An "i.", "ii." or "iii." could be recognized as a pseudo TAB character and OmniPage will insert a single TAB character.
WHAT YOU MUST DO TO AVOID THIS OMNIPAGE BUG
For each page that is processed by the OmniPage OCR that you are going to change, you MUST check the OmniPage “Bulleted” checkbox to make sure the page is not at risk.
After completing the OCR and Proofreading EACH PAGE, you should
If the Bulleted” checkbox is BLANK, then your page is OK to modify.
If the “Bulleted” checkbox is NOT BLANK, you have hit the bug. OmniPage has recognized characters on the page as special formatting characters and will possibly remove them if the text is rearranged.
If you modify the text or process the file with additional programs, you should also manually check all the text. I have observed that some text that could be interpreted as lower case Roman numerals on the RIGHT-HAND side of the page was simply deleted by OmniPage.
You can examine the LEFT-HAND column for the set of characters that are at risk. OmniPage will incorrectly process a large number of character sequences. All that seems to be required is that the characters be in the left hand column and “look like” a fragment of a list.
I have added a file that shows the problem. Download it and use OmniPage to see the bug.
omnipage_bullet_error_example.pdf
When you have finished recognition of a document and have applied your own paragraph formatting schemes to the document. OmniPage may not EXPORT the page to your destination file. You will need to check EACH EXPORTED PAGE to make sure that it was exported. This problem seems to happen to me on EVERY document. The first document Exported that I checked had 16 of 100 pages missing. 15% - 20% of the pages seem to be skipped.
I have seen this "reoccur" in a file where I have repaired it, close and reopened.
There is no way to tell which page has been SKIPPED except to visually verify that each page has been EXPORTED by examining the EXPORTED file. The easiest way is to print each page to its own file and then sort the list by SIZE. A small file (about 2kb) will indicate that there is likely a problem.
A workaround is to make sure you insert formatting on the first and last line of the EACH problem page to RESTORE the paragraphing back to the LEFT "Normal" margin.
I have not exercised all the conditions but this workaround succeeded on two documents so far. This workaround triggered something in OmniPage that will cause the printing of the MISSING page in the document.