No text is shown when using PdfFileReader

Question

So suppose I'd like to extract the text from a pdf file such as this one: https://www.lyxoretf.nl/pdfDocuments/Factsheets/RFACT_FR0010377028_EN_20190131_NLD.pdf?pfdrid_c=false&uid=4cc6aef9-9e75-46d7-9416-65cd7b2b5dd6&download=null

import io
import requests
from PyPDF2 import PdfFileReader

url = 'https://www.lyxoretf.nl/pdfDocuments/Factsheets/RFACT_FR0010377028_EN_20190131_NLD.pdf?pfdrid_c=false&uid=4cc6aef9-9e75-46d7-9416-65cd7b2b5dd6&download=null'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('
')

Using the code provided in related links unfortunately doesn't return the text in the file.

Is there a way to extract the text from these types of files?

Rahul Agarwal · Accepted Answer

import fitz     ## pip install PyMupdf  
path = r'\Factsheets_RFACT_FR0010377028_EN_20190131_NLD.pdf' ## This should be stored somewhere in your system/laptop/computer
text=""
doc = fitz.open(path)
for page in doc:                            
    text+=(page.getText())

No text is shown when using PdfFileReader

Answers (2)

Related Questions