Reputation: 37
Hello guys and thank you in advance for helping me.
So basically, i am trying scrape data from a pdf.
this is the pdf data:
what i want to do is extract data from it like that:
i tried to do it with tabula but gave me this:
and i tried with regular expression but nothing.
can you please help me
import tabula
import pandas as pd
import numpy as np
df = (pd.concat(
tabula.read_pdf(
"/content/drive/MyDrive/Stage/word.pdf", pages="all", pandas_options={"header": None}))
.squeeze().str.extract(r"\)\s*([^\s]+)\s*([a-z\s,]+)?\s*([A-Z\s]+)?\s*(\w\d+)")
.stack(dropna=False).strip().unstack()
.set_axis(["word", "type", "comment", "suffix"], axis=1)
[["word", "type"]] #uncomment this line to match your expected output
)
df.to_excel("table.xlsx", index=False) #uncomment this line to make a spreadsheet
print(df)
Upvotes: -1
Views: 278
Reputation: 37857
You can try something like this with tabula-py & pandas :
import tabula
df = (pd.concat(
tabula.read_pdf(
"file.pdf", pages="all", pandas_options={"header": None}))
.squeeze().str.extract(r"\)\s*([^\s]+)\s*([a-z\s,]+)?\s*([A-Z\s]+)?\s*(\w\d+)")
.stack(dropna=False).str.strip().unstack()
.set_axis(["word", "type", "comment", "suffix"], axis=1)
#[["word", "type"]] #uncomment this line to match your expected output
)
#df.to_excel("table.xlsx", index=False) #uncomment this line to make a spreadsheet
Output :
print(df)
word type comment suffix
0 abandon verb STOP DOING C1
1 abnormal adjective NaN C1
2 aboard adverb, preposition NaN C1
3 abortion noun NaN C1
4 absolutely! NaN NaN C1
5 absorb verb REMEMBER C1
6 abuse noun WRONG ACTION C1
7 accelerate verb HAPPEN C1
8 acceptable adjective ALLOWED C1
9 acceptance noun NaN C1
10 accepted adjective NaN C1
PDF used :
Upvotes: 3