Convert image in to data frame in R

Question

I want convert the image in to data frame. I am able to read the image and make it a character object.

library(tesseract)
eng1 <- tesseract("eng")
text1 <- tesseract::ocr("image.png", engine = eng1)
cat(text1)

class(text1) #character

head(text1)
[1] "* teamiD ~ yearlD ~~ NumsP_ TGS oplayerID Gx GS Wx > Lx ERAX

1 ANA 1997 11162 finlecnot 12 23 ca 8 452
2 ANA 1997 11162 langsma0t 162 2 ae 8 452
3 ANA 1997 11162 perismant 162 8 ae 8 452
4 ANA 1997 11162 cieksja01 162 32 ae 8 452
5 ANA 1997 11162 hasegshot 162 7 ae 8 452
6 ANA 1997 11162 sprindeot 162 8 ae 8 452
7 ANA 1997 11162, mayda02 162 2 ae 8 452
8 ANA 1997 1162 ilkeot 12 2 ae 8 452
9 ANA 1997 11162 grosskeot 162 3 ae 8 452
10 ANA 1997 11162 gubiemao1 162 2 ae 8 452
11 ANA 1997 11162 watsoalot 12 Be ae 8 452
12 ANA 1998 2162 sparksto 162 2 8 7 449
13, ANA 1998 9 162 cicksja0t 162 18 8 7 449
14. ANA 1998 9162 alivaomot 12 26 85 7 449
15. ANA 1998 9162 finlecnot 12 Be 85 7 449
16 ANA 1998 9162 washbja0t 162 1\" 8 7 449
47 ANA 1998 9162 watsoalot 12 4 85 7 449
18 ANA 1998 9162 medowja0t 162 14 8 7 449
"

I need the data in data frame that this image show.

Chris · Accepted Answer

Your ocr is not a faithful representation of your image, and you risk garbage in/garbage out, but text1 can be sequentially cleaned to eventually be in a data.frame:

text1_a <- gsub('[[:punct:]]', '', text1)
test1_b <- gsub('(11)', '\1 ', text1_a)
test1_c <- gsub('(91{1})', '\1 ', test1_b)
test1_d <- gsub('(21{1})', '\1 ', test1_c)
read.table(text=test1_d, col.names=c('teamID', 'yearID', 'NumSP', 'TGS','playerID', 'G.x', 'GS', 'W.x', 'L.x', 'ERA.x'))
   teamID yearID NumSP TGS  playerID G.x GS W.x L.x ERA.x
1     ANA   1997    11 162 finlecnot  12 23  ca   8   452
2     ANA   1997    11 162 langsma0t 162  2  ae   8   452
3     ANA   1997    11 162 perismant 162  8  ae   8   452
4     ANA   1997    11 162 cieksja01 162 32  ae   8   452
5     ANA   1997    11 162 hasegshot 162  7  ae   8   452
6     ANA   1997    11 162 sprindeot 162  8  ae   8   452
7     ANA   1997    11 162   mayda02 162  2  ae   8   452
8     ANA   1997    11  62    ilkeot  12  2  ae   8   452
9     ANA   1997    11 162 grosskeot 162  3  ae   8   452
10    ANA   1997    11 162 gubiemao1 162  2  ae   8   452
11    ANA   1997    11 162 watsoalot  12 Be  ae   8   452
12    ANA   1998    21  62  sparksto 162  2   8   7   449
13    ANA   1998     9 162 cicksja0t 162 18   8   7   449
14    ANA   1998    91  62 alivaomot  12 26  85   7   449
15    ANA   1998    91  62 finlecnot  12 Be  85   7   449
16    ANA   1998    91  62 washbja0t 162  1   8   7   449
47    ANA   1998    91  62 watsoalot  12  4  85   7   449
18    ANA   1998    91  62 medowja0t 162 14   8   7   449

The main challenge here is to arrive at consistent number of columns per row, and much of the ocr garbage remains, with a little more added in, but is a data.frame.

Just using the setting described in weird data, I get the following improvements in ocr after image cleaning with magick, here just from downloading your embedded image above:

team_img <- image_read("team_data_WrCDj.png")

> team_mgk <- team_img %>%
+ image_resize('2000x') %>%
+ image_convert(type = 'Grayscale') %>%
+ image_trim(fuzz = 40) %>%
+ image_write(format = 'png', density = '300x300') %>%
+ tesseract::ocr()
> cat(team_mgk)
~ teamID yearlD NumSP TGS playerID G.x GS W.x Lx ERA.x

1 ANA 1997 11 162 finlechO1 162 25 eB 78 4.52
2 ANA 1997 11 162 langsma01 162 9 fond 78 4.52
3 ANA 1997 11 162 perisma0t 162 8 & 78 4.52
4 ANA 1997 11 162 dicksja01 162 32 & 78 4.52
5 ANA 1997 1 162 hasegsh01 162 7 & 78 4.52
6 ANA 1997 11 162 sprinded1 162 28 & 78 4.52
7 ANA 1997 11 162 maydad2 162 2 & 78 4.52
8 ANA 1997 1 162 hillkeO1 162 12 & 78 4.52
9 ANA 1997 11 162 grosske01 162 3 & 78 4.52
10 ANA 1997 11 162 gubicmad1 162 2 4 78 4.52
11. ANA 1997 11 162 watsoal01 162 is & 78 4.52
12 ANA 1998 9 162 sparkst01 162 20 65 77 4.49
13, ANA 1998 9 162 dicksja01 162 18 65 7 4.49
14 ANA 1998 9 162 olivaom01 162 26 65 7 4.49
15 ANA 1998 9 162 finlechO1 162 34 65 7 4.49
16 ANA 1998 9 162 washbja01 162 11 65 7 4.49
17 ANA 1998 9 162 watsoal01 162 14 65 7 4.49
18 ANA 1998 9 162 mcdowja01 162 14 65 77 4.49

So, still some GS, W.x, L.x problems, but things are improving. I think you'll become much more of an expert at magick through this process. And with better image, none of the above regex approach is necessary...

Convert image in to data frame in R

Answers (1)

Related Questions