Reputation: 11
I'm trying to auto detect a number from a black on white label within a photograph, and rename the image with that number. I'm struggling to optimize the image for OCR recognition, isolate the 5 digit number that I want from the rest of the text that's also on the label (there is only 1x large text 5 digit number, but it may be repeated multiple times and there are other numbers on the label) & autodetect the orientation of the image- I plan on running a for loop to detect the label from all images within a folder, and they are not all oriented in the same way. The colors in the background will vary from the example provided, but never be black on white. The label is not always in the same place. I haven't gotten to the code for renaming the file yet- bonus points if you do.
`
library(tesseract)
library(magick)
library(stringr)
white<-tesseract(options=list(editor_image_word_bb_color='white',stopper_smallword_size=5,tessedit_char_whitelist = "0123456789abcde"))
label<- magick::image_read("file.jpg")%>%
magick::image_convert('tiff')%>%
magick::image_resize("2000x") %>%
magick::image_rotate(180) %>%
magick::image_convert(type = 'Grayscale')%>%
magick::image_level(black_point=80,
white_point=81) %>%
magick::image_reducenoise()%>%
magick::image_trim(fuzz = 65)%>%
tesseract::ocr(engine=white)
label`
The above code doesn't reliably detect the 5-digit number & I have to manually change the image_rotate value to find the correct orientation for a hopeful output (hopeful being that I can see the 5 digits in the correct order, not that it's perfect). The label is not always in the same place, so it doesn't help to autocrop the image to remove noise. I would then pipeline into a string recognition, similar to below, but that needs work depending on the output of the above.
' str_extract("\b\d{5}\b")'
Upvotes: 0
Views: 224
Reputation: 11747
In Windows but should be very similar for Linux/Mac
for deskew see https://stackoverflow.com/a/72701494/10802527
You can simply ask tesseract to output the results which you filter for numbers and then use the number for REN now the issue is what to do if there is no number so you need to report that too
So we need to capture that output (I would usually be using a 2nd redirect to result.txt) and parse result again with the filename variable, however as your using "r", I will leave that secondary task to you.
tesseract.exe %filename.ext% temp -l eng --psm 11 & type temp.txt |findstr /R "[0-9][0-9][0-9][0-9][0-9]"
Upvotes: 0
Reputation: 2213
I have been able to extract the numbers with the following code which is not using Tessseract but is using the OCR of the Word software :
library(pdftools)
library(RDCOMClient)
library(magick)
################################################
#### Step 1 : We convert the image to a PDF ####
################################################
path_PDF <- "D:\\im.pdf"
path_JPG <- "D:\\im.jpg"
path_Word <- "D:\\im.docx"
pdf(path_PDF, height = 12, width = 8)
im <- image_read(path_JPG)
plot(im)
dev.off()
####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
###############################################
#### Step 3 : Convert word document to pdf ####
###############################################
wordApp[["ActiveDocument"]]$SaveAs(path_PDF, FileFormat = 17) # FileFormat = 17 saves as .PDF
doc$Close()
wordApp$Quit() # quit wordApp
########################################
#### Step 4 : Extract text from PDF ####
########################################
strsplit(pdf_text(path_PDF), "\n")
[[1]]
[1] "7/434?."
[2] ""
[3] ""
[4] ""
[5] ""
[6] " Bay CI Block 2752 Rank 2752 21"
[7] " Block Middle Dock RETRIEVE"
[8] " DATE"
[9] " Deployed 6/21/2022"
[10] ""
[11] ""
[12] ""
[13] ""
[14] " 21996"
[15] " etrieved"
[16] ""
[17] ""
[18] ""
[19] " PVC 2199t"
Upvotes: 0
Reputation: 2213
I have been able to extract the 5 digit number with the following code with tesseract :
library(tesseract)
library(magick)
path_PDF <- "D:\\stackoverflow_im.pdf"
path_PNG <- "D:\\stackoverflow_im.jpg"
pdf(path_PDF, height = 24, width = 12)
im <- image_read(path_PNG)
plot(im)
dev.off()
im <- image_read_pdf(path_PDF)
strsplit(ocr(im), "\n")
[1] "r oS ."
[2] "Sod La . - a x"
[3] "ee A: ° : G"
[4] "= “ng - . ie]"
[5] "iad eee Oe : o"
[6] ". APi ; ."
[7] "; fe ig Sa —— = = * = ="
[8] "——— eee ."
[9] "ce A"
[10] "y - - ™ ~"
[11] ". ~~ . ae ee ee ‘\\"
[12] "J ec, —— . ss in ——_—— ———— Y"
[13] "y 7 ———_* , : ia: OO"
[14] ". y / , a Bian Kae Ri 2 : . . ~ :"
[15] "7 ee - . ce"
[16] "’ _ ~ rf"
[17] "} = 72 : ; : a Te ns —— aiaiaieae ee *s : t : eH"
[18] "= nk BLES 2 1"
[19] "= Bay/C] Block} 755 | *™ ay Mol ="
[20] ", | : 2752"
[21] "| \\ —f nd e : >f RIFE \\"
[22] "| = Block| Middle Dock RETRIEV"
[23] ". = 94/9022"
[24] "i) im a. | Cl 2752| 2199€"
[25] "ti | 21996 — ;"
[26] ". VN ="
[27] "me = QQ"
[28] "—v = eetrieved ‘: 4"
[29] "; a 5 BRetrievec CI 2752"
Upvotes: 0