Reputation: 41
I've made a python script that takes a pdf with phrases and extract them into an anki deck. The script worked great with non semitic languages but when someone asked me to make a similar deck in Arabic I encountered a problem. In arabic you write from right to left, but the sentence I get it's written from left to write. It must be something about the extraction phase that need something extra to work with semitic languages, I just don't know what it is.
Example:
The text that I got: sentence = "AR.(ةناشطع ♀) ناشطع نينكلو (ةعئاج تسل ♀) ،اعئاج تسل"
I used PyPDF2 to extract the text and tried arabic-reshaper 2.1.4 and python-bidi to solve this but to no avail. I also tried reverse in various forms but it also reverses punctuation signs like "(". Any ideas?
Upvotes: 2
Views: 1295
Reputation: 1
import pdfplumber
file = 'sample_page.pdf'
pdf = pdfplumber.open(file)
page = pdf.pages[0]
text = page.extract_text(line_dir_render="ttb", char_dir_render="rtl")
print(text[:110])
This will give the perfect result without manual reverse of the string by code.
Upvotes: 0
Reputation: 624
I've had some success extracting Arabic text from (born digital) PDFs using pdfplumber
. By "some success" I mean that it was a huge pain in the... neck, and didn't end up being accurate enough for my purposes. The pain part was because the extracted text was backwards and it had inserted a space next to every diacritic. Those were fixable — some code is below.
But the accuracy problem was because I was using a PDF of an Arabic novel that was written in a pretty font where some of the letters are kind of stacked on top of each other. pdfplumber
was mostly able to extract what letters were there, but not which order. (Not surprising — this is tough for human students of Arabic as well.) If your source is using a plain font you might have better results.
The text in the sample below should read: في رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق الخد الخشبي الضخم علي هيئة قبضة يد نصف مضمومة. الهدوء
import pdfplumber
file = 'sample_page.pdf'
pdf = pdfplumber.open(file)
page = pdf.pages[0]
text = page.extract_text()
print(text[:110])
output:
دّ لخا قوف ترّ قتسا يتلا ةيّ ساحنلا بابلا ةقّ دم تنيّ بت جهنلا سأر في لاإ مٌ يّ مخ ءودلها .ةمومضم فصن دي ةض
^ This is backwards and all there are spaces next to the diacritics
# Reverse text with bidi
from bidi import algorithm
text_rev = algorithm.get_display(text)
print(text_rev[:110])
output:
يف رأس النهج تب ّينت مد ّقة الباب النحاس ّية التي استق ّرت فوق اخل ّد
اخلشب ّي الضخم عىل هيئة قبضة يد نصف مضم
^ Not backwards anymore, but still the diacritic problem
# Strip most common diacritic — in real use you would need to get all of them
shadda = unichr(0x0651)
text_rev_dediac = text_rev.replace(" "+shadda, '')
print(text_rev_dediac[:110])
output:
يف رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق اخلد
اخلشبي الضخم عىل هيئة قبضة يد نصف مضمومة. اهلدوء
^ This is right, except where the stacked letters are in the wrong order (like the first word is supposed to be في (fy 'in') but instead it's يف (yf). You can see that the period (after the word مضمومة) is still in the correct place, though. So this is pretty suceessful, and might be 100% accurate with an easier font.
Good luck!
Upvotes: 1