Reputation: 271
So suppose I'd like to extract the text from a pdf file such as this one: https://www.lyxoretf.nl/pdfDocuments/Factsheets/RFACT_FR0010377028_EN_20190131_NLD.pdf?pfdrid_c=false&uid=4cc6aef9-9e75-46d7-9416-65cd7b2b5dd6&download=null
import io
import requests
from PyPDF2 import PdfFileReader
url = 'https://www.lyxoretf.nl/pdfDocuments/Factsheets/RFACT_FR0010377028_EN_20190131_NLD.pdf?pfdrid_c=false&uid=4cc6aef9-9e75-46d7-9416-65cd7b2b5dd6&download=null'
r = requests.get(url)
f = io.BytesIO(r.content)
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')
Using the code provided in related links unfortunately doesn't return the text in the file.
Is there a way to extract the text from these types of files?
Upvotes: 0
Views: 77
Reputation: 136665
pypdf
has improved a lot in 2022. Give it another shot:
from io import BytesIO
import urllib.request
import pypdf
url = "https://www.amundietf.nl/download/f510d5bd-9e87-4ddd-b3af-dc62dc6fca3e/MonthlyFactsheet_2053760_13166_NLD_ENG_ETF_INSTITUTIONNEL_20221031.pdf"
data = urllib.request.urlopen(url).read()
# creating a pdf reader object
reader = pypdf.PdfReader(BytesIO(data))
# printing number of pages in pdf file
print(len(reader.pages))
# creating a page object
page = reader.pages[0]
# extracting text from page
print(page.extract_text())
Upvotes: 0
Reputation: 4100
import fitz ## pip install PyMupdf
path = r'\Factsheets_RFACT_FR0010377028_EN_20190131_NLD.pdf' ## This should be stored somewhere in your system/laptop/computer
text=""
doc = fitz.open(path)
for page in doc:
text+=(page.getText())
Upvotes: 1