Ami
Ami

Reputation: 319

The text is not recognized from png using Tesseract

I have to pull data from a pdf uploaded at a URL. The pdf is in an image/.png format hence while using the tesseract package few of the lines were not recognized.

The code:

library(rvest)
library(dplyr)
library(pdftools)
library(tesseract)

url="https://www.hindustancopper.com/Page/PriceCircular"
links=url %>% 
  #reading the html of the url
  read_html()%>%
  #fetching out the nodes and the attributes
  html_nodes("#viewTable li:nth-child(1) a") %>% html_attr("href")%>%
  #replacing few strings
  str_replace("../..",'')
str(links)

#using pdftools to read the pdf
base_url <- 'https://www.hindustancopper.com'
# combine the base url with the event url
event_url <- paste0(base_url, links)
event_url

#since the link has a scan copy and not the pdf itself hence using tesseract package
pdf_convert(event_url, 
            pages = 1, 
            dpi = 850, 
            filenames = "page1.png")
# what does the data look like
text <- ocr("page1.png")
cat(text)

The actual output reads the list of products and its prices as:

CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567 
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.

The expected output should be:

CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc

I have tried several times changing the value of dpi argument but that did not help much. Thanks in advance!

Upvotes: 1

Views: 532

Answers (1)

us2018
us2018

Reputation: 643

I am using Ubuntu 18.04 and tesseract 5.0.0-alpha-647-g4a00 for below command.

I downloaded one of sample pdf as referred in your code.

https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf

Then I convert it to png using this command

pdftoppm 0-637189269505122500-AnnualReport.pdf report.png -png

Then by using gimp, I rotate the document so that it is leveled.

Then I use this tesseract command to translate the document.

tesseract report.png stdout -l eng --oem 3 --psm 6 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789:.-/ "

Here is the result:

HINDUSTAN COPPER LIMITED
A GOVT. OF INDIA ENTERPRISE
kK
Registered Head Office
Tamra Bhavan
1 Ashutosh Chowdhury Avenue
Kolkata - 700019
Ref: HCL/HO/MKTG/Cu-P/ 2019-2020
Date : 02-MAR-20
Sub: Basic Price of Cathodes and CC Rods for the month of MAR 2020.
The Basic Price of Copper Cathodes and CC Copper Rods for the month of MAR 2020 are as follows:
Basic Price Ex-Works /
Ex.Godown basis Rs. / MT
CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056
COPPER CATHODE CUT 437856
CONTINUOUS CAST COPPER WIRE ROD 8 MM 440078
CONTINUOUS CAST COPPER WIRE ROD 19.6 MM 444546
CONTINUOUS CAST COPPER WIRE ROD 12.5 MM 441567
Note : Monthly LME CSP Avg. : 5686.45 Monthly Avg. Exchange Rate : 71.59
The price ruling on the date of delivery will be applicable. irrespective of the date of making financial arrangements i.e.
advance payment/opening of letter of credit. GST other statutory levies will be extra as applicable.
For purchase against usance Letter of Credit the interest rate chargeable shall be 10 per annum for the credit
period up to 90/60/30 days.
Customers may note that the price and interest rate is subject to change without prior notice. The price and interest rate
ruling on the date of delivery will be applicable irrespective of the date of their making financial arrangements. All bank
charges of negotiating bank will be borne by us.
ADD YAS
Zl Bl rTeri68
S Parashar
DGM Commercial

Upvotes: 2

Related Questions