Classification and Extraction Efficiency Revisited

Sdurbin · July 2020

Another couple quick questions about classification and extraction:

1. Are extraction results used during classification shared across document layouts? In other words, will classification be more efficient if the classification extractors use shared data types whenever possible?
2. Are extraction results used during classification reused during data extract? In other words, will extraction be more efficient if classification and extraction use shared data types whenever possible?

GrooperGuru · July 2020

So when each task is performed on a document, it is essentially a discrete workload. The real performance benefits come from caching, but there really isn't anything cached by one activity that is later reused by another. But within the context of extraction, if you create an extractor that is referenced 100 times throughout your model for that document, it isn't going to run the extractor 100 times. It runs once, and the results are cached in memory until the extraction is completed for that activity and Grooper moves on to the next. At that point, the cache is cleared and a new one is created for the next document.

On the classification side, I am about 95% sure that there isn't really any compute efficiency to be gained based on the reuse of common extractors. Though when performing rules based classification, the extraction stops as soon as the first rule hits. So that can often run pretty quickly depending on the overall number of rules and document types.

Classification and Extraction Efficiency Revisited

Best Answer