Reputation: 165
This question is probably a duplicate, but none of the answers in similar questions helped me. I'm looking for a simple way to extract text from a pdf file into any other type of file or structure which will let me use it.
the text I want to extract appears on pages 78-79.
At the end of the processes, I want to write each cell from the table in different rows in a .txt
file. for example, I want to turn the first row in the table from this:
to this:
0x00
Channel standby
CH_7
CH_6
CH_5
CH_4
CH_3
CH_2
CH_1
CH_0
0x00
RW
I'm using Visual Studio 2017 but I can also work on Pycharm instead.
I've tried using all the options suggested in this question and here
but I'm having problems installing the required libraries on windows 10 OS. I'm also not sure whether those libraries are still in use and supported. I'd appreciate it if anyone could refer me to some updated material on this subject or refer me to the relevant library.
Thank you.
Upvotes: 0
Views: 100
Reputation: 10809
Here's something using PyMuPDF (pip install pymupdf
).
In this example, get_document_bytes
simply makes a request the PDF resource at the URL you provided (using the third-party requests
module), and returns the PDF bytes. We use the bytes in main
to create a fitz.Document
instance via the stream
parameter. You could also just download the PDF file manually and provide a filename
instead of a stream of bytes, but I didn't feel like doing that. We grab a specific page from the document and print all the text on that page:
def get_document_bytes():
import requests
url = "https://www.mouser.co.il/datasheet/2/609/AD7768-7768-4-1502035.pdf"
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.content
def main():
import fitz
desired_page = 78
doc = fitz.Document(stream=get_document_bytes(), filetype="PDF")
page = doc.loadPage(page_id=desired_page-1)
print(page.getText())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
AD7768/AD7768-4
Data Sheet
Rev. B | Page 78 of 105
AD7768 REGISTER MAP DETAILS (SPI CONTROL)
AD7768 REGISTER MAP
See Table 63 and the AD7768-4 Register Map Details (SPI Control) section for the AD7768-4 register map and register functions.
Table 37. Detailed AD7768 Register Map
Reg.
Name
Bit 7
Bit 6
Bit 5
Bit 4
Bit 3
Bit 2
Bit 1
Bit 0
Reset RW
0x00
Channel standby
CH_7
CH_6
CH_5
CH_4
CH_3
CH_2
CH_1
CH_0
0x00
RW
...
I realize you want the text from two pages, not just one - and you also don't want all the text from these pages, just the stuff that's in the table. This is just to get you started - I may tinker around with this a bit more, and update my post later.
Upvotes: 1