Reputation: 170
I want convert the image in to data frame. I am able to read the image and make it a character object.
library(tesseract)
eng1 <- tesseract("eng")
text1 <- tesseract::ocr("image.png", engine = eng1)
cat(text1)
class(text1) #character
head(text1)
[1] "* teamiD ~ yearlD ~~ NumsP_ TGS oplayerID Gx GS Wx > Lx ERAX\n\n1 ANA 1997 11162 finlecnot 12 23 ca 8 452\n2 ANA 1997 11162 langsma0t 162 2 ae 8 452\n3 ANA 1997 11162 perismant 162 8 ae 8 452\n4 ANA 1997 11162 cieksja01 162 32 ae 8 452\n5 ANA 1997 11162 hasegshot 162 7 ae 8 452\n6 ANA 1997 11162 sprindeot 162 8 ae 8 452\n7 ANA 1997 11162, mayda02 162 2 ae 8 452\n8 ANA 1997 1162 ilkeot 12 2 ae 8 452\n9 ANA 1997 11162 grosskeot 162 3 ae 8 452\n10 ANA 1997 11162 gubiemao1 162 2 ae 8 452\n11 ANA 1997 11162 watsoalot 12 Be ae 8 452\n12 ANA 1998 2162 sparksto 162 2 8 7 449\n13, ANA 1998 9 162 cicksja0t 162 18 8 7 449\n14. ANA 1998 9162 alivaomot 12 26 85 7 449\n15. ANA 1998 9162 finlecnot 12 Be 85 7 449\n16 ANA 1998 9162 washbja0t 162 1\" 8 7 449\n47 ANA 1998 9162 watsoalot 12 4 85 7 449\n18 ANA 1998 9162 medowja0t 162 14 8 7 449\n"
I need the data in data frame that this image show.
Upvotes: 0
Views: 865
Reputation: 2296
Your ocr
is not a faithful representation of your image, and you risk garbage in/garbage out, but text1
can be sequentially cleaned to eventually be in a data.frame:
text1_a <- gsub('[[:punct:]]', '', text1)
test1_b <- gsub('(11)', '\\1 ', text1_a)
test1_c <- gsub('(91{1})', '\\1 ', test1_b)
test1_d <- gsub('(21{1})', '\\1 ', test1_c)
read.table(text=test1_d, col.names=c('teamID', 'yearID', 'NumSP', 'TGS','playerID', 'G.x', 'GS', 'W.x', 'L.x', 'ERA.x'))
teamID yearID NumSP TGS playerID G.x GS W.x L.x ERA.x
1 ANA 1997 11 162 finlecnot 12 23 ca 8 452
2 ANA 1997 11 162 langsma0t 162 2 ae 8 452
3 ANA 1997 11 162 perismant 162 8 ae 8 452
4 ANA 1997 11 162 cieksja01 162 32 ae 8 452
5 ANA 1997 11 162 hasegshot 162 7 ae 8 452
6 ANA 1997 11 162 sprindeot 162 8 ae 8 452
7 ANA 1997 11 162 mayda02 162 2 ae 8 452
8 ANA 1997 11 62 ilkeot 12 2 ae 8 452
9 ANA 1997 11 162 grosskeot 162 3 ae 8 452
10 ANA 1997 11 162 gubiemao1 162 2 ae 8 452
11 ANA 1997 11 162 watsoalot 12 Be ae 8 452
12 ANA 1998 21 62 sparksto 162 2 8 7 449
13 ANA 1998 9 162 cicksja0t 162 18 8 7 449
14 ANA 1998 91 62 alivaomot 12 26 85 7 449
15 ANA 1998 91 62 finlecnot 12 Be 85 7 449
16 ANA 1998 91 62 washbja0t 162 1 8 7 449
47 ANA 1998 91 62 watsoalot 12 4 85 7 449
18 ANA 1998 91 62 medowja0t 162 14 8 7 449
The main challenge here is to arrive at consistent number of columns per row, and much of the ocr
garbage remains, with a little more added in, but is a data.frame.
Just using the setting described in weird data, I get the following improvements in ocr
after image cleaning with magick
, here just from downloading your embedded image above:
team_img <- image_read("team_data_WrCDj.png")
> team_mgk <- team_img %>%
+ image_resize('2000x') %>%
+ image_convert(type = 'Grayscale') %>%
+ image_trim(fuzz = 40) %>%
+ image_write(format = 'png', density = '300x300') %>%
+ tesseract::ocr()
> cat(team_mgk)
~ teamID yearlD NumSP TGS playerID G.x GS W.x Lx ERA.x
1 ANA 1997 11 162 finlechO1 162 25 eB 78 4.52
2 ANA 1997 11 162 langsma01 162 9 fond 78 4.52
3 ANA 1997 11 162 perisma0t 162 8 & 78 4.52
4 ANA 1997 11 162 dicksja01 162 32 & 78 4.52
5 ANA 1997 1 162 hasegsh01 162 7 & 78 4.52
6 ANA 1997 11 162 sprinded1 162 28 & 78 4.52
7 ANA 1997 11 162 maydad2 162 2 & 78 4.52
8 ANA 1997 1 162 hillkeO1 162 12 & 78 4.52
9 ANA 1997 11 162 grosske01 162 3 & 78 4.52
10 ANA 1997 11 162 gubicmad1 162 2 4 78 4.52
11. ANA 1997 11 162 watsoal01 162 is & 78 4.52
12 ANA 1998 9 162 sparkst01 162 20 65 77 4.49
13, ANA 1998 9 162 dicksja01 162 18 65 7 4.49
14 ANA 1998 9 162 olivaom01 162 26 65 7 4.49
15 ANA 1998 9 162 finlechO1 162 34 65 7 4.49
16 ANA 1998 9 162 washbja01 162 11 65 7 4.49
17 ANA 1998 9 162 watsoal01 162 14 65 7 4.49
18 ANA 1998 9 162 mcdowja01 162 14 65 77 4.49
So, still some GS, W.x, L.x problems, but things are improving. I think you'll become much more of an expert at magick
through this process. And with better image, none of the above regex
approach is necessary...
Upvotes: 1