Scrape data from PDF with python but not from a table or a normal te

Question

Hello guys and thank you in advance for helping me.

So basically, i am trying scrape data from a pdf.

this is the pdf data:

what i want to do is extract data from it like that:

i tried to do it with tabula but gave me this:

and i tried with regular expression but nothing.

can you please help me

import tabula
import pandas as pd
import numpy as np


df = (pd.concat(
         tabula.read_pdf(
              "/content/drive/MyDrive/Stage/word.pdf", pages="all", pandas_options={"header": None}))
         .squeeze().str.extract(r"\)\s*([^\s]+)\s*([a-z\s,]+)?\s*([A-Z\s]+)?\s*(\w\d+)")
         .stack(dropna=False).strip().unstack()
         .set_axis(["word", "type", "comment", "suffix"], axis=1)
     [["word", "type"]] #uncomment this line to match your expected output
     )

df.to_excel("table.xlsx", index=False) #uncomment this line to make a spreadsheet
print(df)

Scrape data from PDF with python but not from a table or a normal te

Answers (1)

Related Questions