Reputation: 11

Use Tesseract in R to detect black writing on white label from photo with multicolored background

I'm trying to auto detect a number from a black on white label within a photograph, and rename the image with that number. I'm struggling to optimize the image for OCR recognition, isolate the 5 digit number that I want from the rest of the text that's also on the label (there is only 1x large text 5 digit number, but it may be repeated multiple times and there are other numbers on the label) & autodetect the orientation of the image- I plan on running a for loop to detect the label from all images within a folder, and they are not all oriented in the same way. The colors in the background will vary from the example provided, but never be black on white. The label is not always in the same place. I haven't gotten to the code for renaming the file yet- bonus points if you do.

`
library(tesseract)
library(magick)
library(stringr)

white<-tesseract(options=list(editor_image_word_bb_color='white',stopper_smallword_size=5,tessedit_char_whitelist = "0123456789abcde"))

label<-  magick::image_read("file.jpg")%>%
    magick::image_convert('tiff')%>%
      magick::image_resize("2000x") %>%
  magick::image_rotate(180) %>%
  magick::image_convert(type = 'Grayscale')%>%
  magick::image_level(black_point=80,
                      white_point=81)  %>%
  magick::image_reducenoise()%>%
  magick::image_trim(fuzz = 65)%>%
  tesseract::ocr(engine=white)

label`

The above code doesn't reliably detect the 5-digit number & I have to manually change the image_rotate value to find the correct orientation for a hopeful output (hopeful being that I can see the 5 digits in the correct order, not that it's perfect). The label is not always in the same place, so it doesn't help to autocrop the image to remove noise. I would then pipeline into a string recognition, similar to below, but that needs work depending on the output of the above.

' str_extract("\b\d{5}\b")'

Upvotes: 0

Answers (3)

K J

Reputation: 11747

In Windows but should be very similar for Linux/Mac

for deskew see https://stackoverflow.com/a/72701494/10802527

You can simply ask tesseract to output the results which you filter for numbers and then use the number for REN now the issue is what to do if there is no number so you need to report that too

So we need to capture that output (I would usually be using a 2nd redirect to result.txt) and parse result again with the filename variable, however as your using "r", I will leave that secondary task to you.

tesseract.exe %filename.ext% temp -l eng --psm 11 & type temp.txt |findstr /R "[0-9][0-9][0-9][0-9][0-9]"

Upvotes: 0

Emmanuel Hamel

Reputation: 2213

I have been able to extract the numbers with the following code which is not using Tessseract but is using the OCR of the Word software :

library(pdftools)
library(RDCOMClient)
library(magick)

################################################
#### Step 1 : We convert the image to a PDF ####
################################################

path_PDF <- "D:\\im.pdf"
path_JPG <- "D:\\im.jpg"
path_Word <- "D:\\im.docx"

pdf(path_PDF, height = 12, width = 8)

im <- image_read(path_JPG)
plot(im)
dev.off()

####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE

doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
                                   ConfirmConversions = FALSE)

doc$SaveAs2(path_Word)

###############################################
#### Step 3 : Convert word document to pdf ####
###############################################
wordApp[["ActiveDocument"]]$SaveAs(path_PDF, FileFormat = 17) # FileFormat = 17 saves as .PDF
doc$Close()
wordApp$Quit() # quit wordApp

########################################
#### Step 4 : Extract text from PDF ####
########################################
strsplit(pdf_text(path_PDF), "\n")
[[1]]
 [1] "7/434?."                                                                                 
 [2] ""                                                                                        
 [3] ""                                                                                        
 [4] ""                                                                                        
 [5] ""                                                                                        
 [6] "                 Bay CI                        Block 2752 Rank              2752      21"
 [7] "                           Block Middle Dock                     RETRIEVE"               
 [8] "                                                                 DATE"                   
 [9] "                Deployed        6/21/2022"                                               
[10] ""                                                                                        
[11] ""                                                                                        
[12] ""                                                                                        
[13] ""                                                                                        
[14] "                                                        21996"                           
[15] "                etrieved"                                                                
[16] ""                                                                                        
[17] ""                                                                                        
[18] ""                                                                                        
[19] "                PVC                                                                2199t"

Upvotes: 0

Emmanuel Hamel

Reputation: 2213

I have been able to extract the 5 digit number with the following code with tesseract :

library(tesseract)
library(magick)
path_PDF <- "D:\\stackoverflow_im.pdf"
path_PNG <- "D:\\stackoverflow_im.jpg"

pdf(path_PDF, height = 24, width = 12)
im <- image_read(path_PNG)
plot(im)
dev.off()

im <- image_read_pdf(path_PDF)
strsplit(ocr(im), "\n")

[1] "r oS ."                                                                                                                    
[2] "Sod La . - a x"                                                                                                            
[3] "ee A: ° : G"                                                                                                               
[4] "= “ng - . ie]"                                                                                                             
[5] "iad eee Oe : o"                                                                                                            
[6] ". APi ; ."                                                                                                                 
[7] "; fe ig Sa —— = = * = ="                                                                                                   
[8] "——— eee ."                                                                                                                 
[9] "ce A"                                                                                                                      
[10] "y - - ™ ~"                                                                                                                 
[11] ". ~~ . ae ee ee ‘\\"                                                                                                       
[12] "J ec, —— . ss in ——_—— ———— Y"                                                                                             
[13] "y 7 ———_* , : ia: OO"                                                                                                      
[14] ". y / , a Bian Kae Ri 2 : . . ~ :"                                                                                         
[15] "7 ee - . ce"                                                                                                               
[16] "’ _ ~ rf"                                                                                                                  
[17] "} = 72 : ; : a Te ns ——  aiaiaieae ee *s : t : eH"                                                                         
[18] "= nk BLES 2 1"                                                                                                             
[19] "= Bay/C] Block} 755 | *™ ay Mol ="                                                                                         
[20] ", | : 2752"                                                                                                                
[21] "| \\ —f nd e : >f RIFE \\"                                                                                                 
[22] "| = Block| Middle Dock RETRIEV"                                                                                            
[23] ". = 94/9022"                                                                                                               
[24] "i) im a. | Cl 2752| 2199€"                                                                                                 
[25] "ti | 21996 — ;"                                                                                                            
[26] ". VN ="                                                                                                                    
[27] "me = QQ"                                                                                                                   
[28] "—v = eetrieved ‘: 4"                                                                                                       
[29] "; a 5 BRetrievec CI 2752"

Upvotes: 0

Use Tesseract in R to detect black writing on white label from photo with multicolored background

Answers (3)

Related Questions