I am having trouble exporting to a text-searchable PDF.

kylesouzakylesouza Posts: 55 ✭✭
I have set up a very simple process to ocr some PDFs:


I want the outputted PDF to be text searchable, so I have these settings:


But when I look at the exported document there is no selectable text. (see attached)
The Recognize process is generating text.


What am I doing wrong?
Kyle Souza
Data Wizard
P&P Oil & Gas Solutions

Best Answer

Answers

  • kylesouzakylesouza Posts: 55 ✭✭
    The "Prefer Child Versions" option fixes 2/3rds of the issue.
    The "Clear Content" option fixed 1/3 of the issue (not the missing 1/3 from above), but causes two other issue.
    Kyle Souza
    Data Wizard
    P&P Oil & Gas Solutions
  • dearnerdearner Posts: 96 admin
    Kyle - when you set prefer child versions to true, you're still getting some exports that have no text associated with them?  Do those documents failing to export correctly have OCR text on their page objects?

    What additional problems does running "clear content" create?
  • kylesouzakylesouza Posts: 55 ✭✭
    Correct, one of my three files is still exporting with no text.  It does have OCR data associated with it in Grooper.

    The Clear Content option is only creating one document even though it is processing two (out of three) of them.  And, the output file has no name, because I am using the native file name to name the output file, so the two outputted files are probably writing over each other, but one file is erroring.
    Kyle Souza
    Data Wizard
    P&P Oil & Gas Solutions
  • dearnerdearner Posts: 96 admin
    Yeah, I see now that you're using nativeversionfilename to name the document; if you remove the native version, obviously that won't work.  I'm curious as to why you're removing the link - if I wanted to export with the same filename, I'd probably base it on the link, rather than the native version filename.

    On the single file that's exporting incorrectly when you have "prefer child versions" turned on: does it behave differently when you run just one file through at a time?  Same question for the "clear content" method: if you clear the content right after you split, does the OCR generate correctly?  If so, then it's just a question of changing your filename expression to reference the link. 
  • dearnerdearner Posts: 96 admin
    One more comment - you might want to turn "embed all fonts" to false; although it shouldn't cause a problem, that was intended for cases where non-ANSI codepage unicode characters are generated and require additional font embedding to ensure compatibility with some PDF tools.  
  • kylesouzakylesouza Posts: 55 ✭✭
    Modifying my process to:

    results in two files exporting with searchable text, and the same one without it.
    I can see the text in Grooper for it though:


    Running just the "problem page" through by itself has the same results.

    -------------------

    Using "Clear Content" has the same end result:



    But the output file has no serachable text.
    Kyle Souza
    Data Wizard
    P&P Oil & Gas Solutions
  • kylesouzakylesouza Posts: 55 ✭✭
    @GrooperGuru that did it! thank you both.
    Kyle Souza
    Data Wizard
    P&P Oil & Gas Solutions
Sign In or Register to comment.