Rapid1898
Rapid1898

Reputation: 1190

How to extract radiobutton / checkbox information with python from a pdf-file?

i would like to get the radio-button / checkbox information from a pdf-document - I had a look at pdfplumber and pypdf2 - but was not able to find a solution with this modules.

I can parse the text using this code - but for the radio-buttons i get only the text - but no information which button (or checkbox) is selected.

import pdfplumber
import os
import sys

if __name__ == '__main__':
  path = os.path.abspath(os.path.dirname(sys.argv[0])) 
  fn = os.path.join(path, "input.pdf")
  pdf = pdfplumber.open(fn)
  page = pdf.pages[0]
  text = page.extract_text()

enter image description here enter image description here

I have also uploaded an example file here: https://easyupload.io/8y8k2v

Is there any way to get this information from the pdf-file using python?

Upvotes: 0

Views: 2408

Answers (1)

Rapid1898
Rapid1898

Reputation: 1190

I think i found a solution using pdfplumber - (probably not elegant - but i can check if the radio-buttons are selected or not)

Generally:

  • i read all chars and all curves for all pages

  • then i sort all elements by x and y (to get the chars and elements in the correct order like in the pdf)

  • then i concatenate the cars and add also blanks when the distance between the chars is longer than in a word

  • i check the pts-information for the carves and get so the information if the radio button is selected or not

  • the final lines and yes/not informatin i store in a list line-by-line for furhter working

    import pdfplumber
    import os
    import sys
    
    fn = os.path.join(path, "input.pdf")
      pdf = pdfplumber.open(fn)
      finalContent = []
        for idx,page in enumerate(pdf.pages, start=1):  
          print(f"Reading page {idx}")
          contList = []
          for e in page.chars:             
            tmpRow = ["char", e["text"], e["x0"], e["y0"]]
            contList.append(tmpRow)
          for e in page.curves:
            tmpRow = ["curve", e["pts"], e["x0"], e["y0"]]
            contList.append(tmpRow)  
          contList.sort(key=lambda x: x[2])
          contList.sort(key=lambda x: x[3], reverse=True)
    
          workContent = []    
          workText = ""
          workDistCharX = False
          for e in contList:
            if e[0] == "char":
              if workDistCharX != False and \
                 (e[2] - workDistCharX > 20 or e[3] - workDistCharY < -2):
                  workText += " / "
              workText += e[1]
              workDistCharX = e[2]
              workDistCharY = e[3]
              continue
            if e[0] == "curve":
              if workText != "":
                workContent.append(workText)
                workText = ""
    
              if e[1][0][0] < 100:
                tmpVal = "SELECT-YES"
              else:
                tmpVal = "SELECT-NO"
    
              workContent.append(f"CURVE {tmpVal}, None, None")
    
          finalContent.extend(workContent)
          workContent = "\n".join(workContent)
    

Upvotes: 0

Related Questions