PDF Text Extract - Did it OCR?

We have a batch process that runs the 'PDF Text Extract' process on PDF documents.

The manual description of PDF Text Extract:
Performs text extraction from PDF files, optionally using an OCR Profile to read image-based content. 

At the end of the batch process, we export the documents using Document Export with the 'Make Searchable' flag set to True, so we get the OCR information exported with the PDF.

Is there a way we can tell if Grooper added additional text OCR information to a document or it just embedded the original text that was already in the document?

Answers

  • jclarkjclark Posts: 60 ✭✭✭
    @hjanum
    You could use OCR Testing in the OCR Profile to run OCR Page on a .PDF page from your documents that would show what OC data was created and you can verify that output against the text on the original .PDF.
  • hjanumhjanum Posts: 78
    I am not sure that the solution will work for us as it requires staff to check every document manually.

    Our archival policy requires us to have a text layer in the PDF's so they can be searched. Many PDF's that we receive already meet this requirement, so we can store the original document. Some documents however come in with only image information in them. Some documents are hybrids with some pages that contain both text and images. We use Grooper to OCR documents and we export all document with OCR/searchable text information from Grooper. We need our outbound service that handles documents after the Grooper processing to automatically detect if the document should be checked in superseding the original version.

    I admit this is a tricky problem to solve. Take the sample document below.


    The document contains text and an image. Grooper did what it's supposed to do and it OCR'ed the image. Had the image been a scanned page of a text document or a web page pasted in we would have gotten useful text.

    The question is how do we automatically detect this in PROD, so we can decide when to check in the Grooper output image. 

    One thought would be to extract all the OCR information available prior to going into Grooper and then do this after Grooper processing. We could then set a threshold of X number of characters. If more than that has been added to the OCR information, then Grooper must have done significant OCR and the document should be checked in. The threshold might be a combination of for example more than 10% added or 1000 characters added to deal with both small and large documents.

    We could do this outside Grooper, but it would be nice to have Grooper provide an indicator, so we don't have to custom code it.

Sign In or Register to comment.