thekingmaker
thekingmaker

Reputation: 23

Filling the gaps made in chinese character due to line removal for ocr

Image

Hello friends, I have a hard time to ocr the above image due to the gaps that were made due to line removal.So could anyone kindly guide me on how to fill the gaps in chinese character using imagemagick

Upvotes: 1

Views: 102

Answers (2)

Mark Setchell
Mark Setchell

Reputation: 207863

Cool question! There are many ways of approaching this but unfortunately I can't tell which ones work! So I'll give you some code and you can experiment by changing it around.

For the moment, I tried simply removing any lines that have white pixels in them, but you could look at the lines above and below, or do something else.

#!/bin/bash -xv

# Get lines containing white pixels
convert chinese.gif -colorspace gray -threshold 80% DEBUG-white-lines.png

# Develop that idea and get the line numbers in an array
wl=( $(convert chinese.gif -colorspace gray -threshold 80% -resize 1x\! -threshold 20% txt: | awk -F '[,:]' '/FFFFFF/{print $2}') )

# White lines are:
echo "${wl[@]}"

# Build a string of a whole load of "chop" commands to apply in one go, rather than applying one-at-a-time and saving/re-loading
# As we chop each line, the remaining lines move up, changing their offset by one line - UGHH. Apply a correction!
chop=""
correction=0
for line in "${wl[@]}" ; do
   ((y=line-correction))
   chop="$chop -chop 0x1+0+$y "
   ((correction=correction+1))
done
echo $chop

convert chinese.gif $chop result.png

Here's the image DEBUG-white-lines.png:

enter image description here

The white lines are identified as:

44 74 134 164 194 254 284 314 374 404

The final command run is:

convert chinese.gif -chop 0x1+0+44 -chop 0x1+0+73 -chop 0x1+0+132 -chop 0x1+0+161 -chop 0x1+0+190 -chop 0x1+0+249 -chop 0x1+0+278 -chop 0x1+0+307 -chop 0x1+0+366 -chop 0x1+0+395 result.png

enter image description here

Upvotes: 1

Ghoul Fool
Ghoul Fool

Reputation: 6967

If I understand this correctly then you want to find a way of removing the white lines and then still get it to go through an OCR?

The best way would be by eye and connect the dots so to speak so the last pixel of the characters line up.

A programitcal way would be to remove the white line ad then duplicate the line above (or below) and shift it into place.

ocr image with gaps filled by hand

康 家 月 而 视 , 喝 道
" 你 想 做 什 么 !"
秦 微 微 一 笑 , 轻 声 道
不 知 道 看 着 些 亲 死 眼 前 ,
前 辈 会 不 会 有 痛 的 感 觉 。"
说 , 伸 手 一 指 , 一 位 少 妇
身 形 一 顿 , 小 出 现 了 一 个 血 洞
倒 地 身 广 。
康 家 相 又 惊 又 , 痛 声 道

I don't read Chinese but this is what it got machine translated as

Kang Jia month and watch, drink
"What do you want to do !"
Qin Weiwei smiled, softly
I don't know. look at some dead eyes. ,
Predecessors will not feel pain ."
And said, stretch out a finger , a young woman.
In The Shape of a meal, a small blood hole appeared
Down to the ground wide.
The Kang family was shocked and sore

Upvotes: 0

Related Questions