Reputation: 1190
i would like to get the radio-button / checkbox information from a pdf-document - I had a look at pdfplumber and pypdf2 - but was not able to find a solution with this modules.
I can parse the text using this code - but for the radio-buttons i get only the text - but no information which button (or checkbox) is selected.
import pdfplumber
import os
import sys
if __name__ == '__main__':
path = os.path.abspath(os.path.dirname(sys.argv[0]))
fn = os.path.join(path, "input.pdf")
pdf = pdfplumber.open(fn)
page = pdf.pages[0]
text = page.extract_text()
I have also uploaded an example file here: https://easyupload.io/8y8k2v
Is there any way to get this information from the pdf-file using python?
Upvotes: 0
Views: 2408
Reputation: 1190
I think i found a solution using pdfplumber - (probably not elegant - but i can check if the radio-buttons are selected or not)
Generally:
i read all chars and all curves for all pages
then i sort all elements by x and y (to get the chars and elements in the correct order like in the pdf)
then i concatenate the cars and add also blanks when the distance between the chars is longer than in a word
i check the pts-information for the carves and get so the information if the radio button is selected or not
the final lines and yes/not informatin i store in a list line-by-line for furhter working
import pdfplumber
import os
import sys
fn = os.path.join(path, "input.pdf")
pdf = pdfplumber.open(fn)
finalContent = []
for idx,page in enumerate(pdf.pages, start=1):
print(f"Reading page {idx}")
contList = []
for e in page.chars:
tmpRow = ["char", e["text"], e["x0"], e["y0"]]
contList.append(tmpRow)
for e in page.curves:
tmpRow = ["curve", e["pts"], e["x0"], e["y0"]]
contList.append(tmpRow)
contList.sort(key=lambda x: x[2])
contList.sort(key=lambda x: x[3], reverse=True)
workContent = []
workText = ""
workDistCharX = False
for e in contList:
if e[0] == "char":
if workDistCharX != False and \
(e[2] - workDistCharX > 20 or e[3] - workDistCharY < -2):
workText += " / "
workText += e[1]
workDistCharX = e[2]
workDistCharY = e[3]
continue
if e[0] == "curve":
if workText != "":
workContent.append(workText)
workText = ""
if e[1][0][0] < 100:
tmpVal = "SELECT-YES"
else:
tmpVal = "SELECT-NO"
workContent.append(f"CURVE {tmpVal}, None, None")
finalContent.extend(workContent)
workContent = "\n".join(workContent)
Upvotes: 0