I am having trouble exporting to a text-searchable PDF.

kylesouza · May 2020

I have set up a very simple process to ocr some PDFs:

Image: https://us.v-cdn.net/6030453/uploads/editor/74/dobcbcl0auea.png

I want the outputted PDF to be text searchable, so I have these settings:

Image: https://us.v-cdn.net/6030453/uploads/editor/vf/rn3hr1jwo6j3.png

But when I look at the exported document there is no selectable text. (see attached)
The Recognize process is generating text.

Image: https://us.v-cdn.net/6030453/uploads/editor/6b/9xu7u6b4w4bu.png

What am I doing wrong?

GrooperGuru · May 2020

Try also setting PDF Page Source to Image.

dearner · May 2020

Kyle - I had this issue pop up with one of our engineers a few weeks ago; what's likely happening is it's re-exporting the source PDF, which is still associated with the document object.

To fix this, you can do one of two things: you can turn on "prefer child versions" on the PDF options, or you can run a content action -> clear content at folder level 1 to remove the imported PDF version and turn the folder into a simple container.

Let me know if this gets you where you need to go once you've tried it!

kylesouza · May 2020

The "Prefer Child Versions" option fixes 2/3rds of the issue.
The "Clear Content" option fixed 1/3 of the issue (not the missing 1/3 from above), but causes two other issue.

dearner · May 2020

Kyle - when you set prefer child versions to true, you're still getting some exports that have no text associated with them? Do those documents failing to export correctly have OCR text on their page objects?

What additional problems does running "clear content" create?

kylesouza · May 2020

Correct, one of my three files is still exporting with no text. It does have OCR data associated with it in Grooper.

The Clear Content option is only creating one document even though it is processing two (out of three) of them. And, the output file has no name, because I am using the native file name to name the output file, so the two outputted files are probably writing over each other, but one file is erroring.

dearner · May 2020

Yeah, I see now that you're using nativeversionfilename to name the document; if you remove the native version, obviously that won't work. I'm curious as to why you're removing the link - if I wanted to export with the same filename, I'd probably base it on the link, rather than the native version filename.

On the single file that's exporting incorrectly when you have "prefer child versions" turned on: does it behave differently when you run just one file through at a time? Same question for the "clear content" method: if you clear the content right after you split, does the OCR generate correctly? If so, then it's just a question of changing your filename expression to reference the link.

dearner · May 2020

One more comment - you might want to turn "embed all fonts" to false; although it shouldn't cause a problem, that was intended for cases where non-ANSI codepage unicode characters are generated and require additional font embedding to ensure compatibility with some PDF tools.