jon
jon

Reputation: 368

OCR with tesseract in R Fails to Recognize all Line Breaks

I'm trying to convert many PDF documents into text in R in order to use string parsing and regex to extract a set of codes from it. I am using ocr from the tesseract library and though it works on many of the pages, it does miss a lot of information that I need.

I identified the problem being inconsistent line breaks in the image/PDF. For example: THIS

I am trying to get the codes from the left column. The only codes that I'm able to extract successfully are the ones where the description is longer than a single line.

I've experimented with various pre-processing techniques using magick but have come up short in most cases. The only instance where I was able to get the code set was cropping the right-hand side out of the image, but unfortunately this is not an efficient solution in my case.

file <- magick::image_read("44F245A2-5FEE-408F-A197-756436A5CAFD.png")

file %>%
  magick::image_resize("2000x") %>%
  magick::image_convert(type = 'Grayscale') %>%
  tesseract::ocr() %>%
  cat()

# or
# descriptions in this document.
# 94942C This is a description that takes on multiple lines. It can contain any combination of
# alphanumeric characters or punctuation. Different types of things can go in here and the
# | terpenes Steet gine see
# 272144 This is a description that takes on multiple lines. It can contain any combination of
# eee
# length of the description could be anywhere from 1 line to 5 lines of text.
# E76744 This is a description that takes on multiple lines. It can contain any combination of
# alphanumeric characters or punctuation. Different types of things can go in here and the
# [terpenes Steet gine see
# K77744 This is a description that takes on multiple lines. It can contain any combination of
# alphanumeric characters or punctuation. Different types of things can go in here and the
# | terrane een Steet gine seem
# 172744 This is a description that takes on multiple lines. It can contain any combination of
# Se
# length of the description could be anywhere from 1 line to 5 lines of text.
# A71744 This is a description that takes on multiple lines. It can contain any combination of
# alphanumeric characters or punctuation. Different types of things can go in here and the
# | teammates Steet gine see

Ideally I would like to be able to get all of the codes from the image in the above link. Any help would be awesome.

Upvotes: -1

Views: 1181

Answers (1)

victormeriqui
victormeriqui

Reputation: 149

Try to use different page segmentation modes, the available segmentation modes are:

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Tre at the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,

Try PSM #4 for your case, from my experience #12 gives the most text, but it might not be in order, which might be an issue if you want to relate the codes with the descriptions.

Upvotes: -1

Related Questions