justinzane
justinzane

Reputation: 1987

Fuzzy Regex, Text Processing, Lexical Analysis?

I'm not quite sure what terminology to search for, so my title is funky... Here is the workflow I've got:

  1. Semi-structured documents are scanned to file. The files are OCR'd to text.
  2. The text is parsed into Python objects
  3. The objects are serialized (to SQL, JSON, whatever) for use.

The documents are structures like this:

HEADER blah blah, Page ###

blah

Garbage text...

1. Question Text...

continued until now. A. Choice text...

adsadsf. B. Another Choice...

2. Another Question...

I need to extract the questions and choices. The problem is that, because the text is OCR output, there are occasional strange substitutions like '2' -> 'Z' which makes ordinary regular expressions useless. I've tried the Levenshtein module and it helps, but it requires prior knowledge of what edit distance is to be expected.

I don't know whether I'm looking to create a parser? a lexer? something else? This has lead me down all kinds of interesting but nonrelevant paths. Guidance would be greatly appreciated. Oh, also, the text is generally from specific technical domains, so general spelling tools are not so helpful.

Regarding the structure of the documents, there is no clear visual pattern -- like line breaks or indentation -- with the exception of the fact that "questions" usually begin a line. Crap on the document can cause characters to appear before the actual beginning of the line, which means that something along the lines of r'^[0-9]+' does not reliably work.

Though the "questions" always begin with an int, a period and a space; the OCR can substitute other characters or skip characters. This is not so much a problem with Tesseract or Cunieform, rather with the poor quality of the paper documents.

#

Note: for the project in question, it was decided that having a human prep the OCR'd text was better that spending the time coding a solution. I'd still love good pointers, however.

Upvotes: 3

Views: 409

Answers (1)

Josep Valls
Josep Valls

Reputation: 5560

From what understand from your statement you may be trying to build a parser. With the vague requirements and examples provided I would suggest you start by taking a look at nltk.org. Another alternative might be gate.ac.uk

Upvotes: 1

Related Questions