Lalit Joshi
Lalit Joshi

Reputation: 137

Extract Text from a word document

I am trying to scrape data from a word document available at:- https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx

I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.

    import docx
    
    content = docx.Document('HE Distributors.docx')
    
    location = []
    for i in range(len(content.paragraphs)):
        stat = content.paragraphs[i].text
        if 'Email' in stat:
            location.append(i)

for i in location:
    print(content.paragraphs[i].text)

I tried to use the steps mentioned: How to read data from .docx file in python pandas?

I need to convert this into a data frame with all the columns mentioned above. Still facing issues with the same.

Upvotes: 0

Views: 1881

Answers (1)

Driftr95
Driftr95

Reputation: 4710

There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.

The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:

# from bs4 import BeautifulSoup

def getParaColor(para):
  try:
    return BeautifulSoup(
        para.paragraph_format.element.xml, 'xml'
    ).find('color').get('w:val')
  except:
    return ''

The try...except hasn't been necessary yet, but just in case...

(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)

Then, you can process the paragraphs from docx.Document with a function like:

# import re

def splitParas(paras):
  ptc = [(
      p.text, getParaColor(p), p.paragraph_format.element.xml
  ) for p in paras]
  curSectn = 'UNKNOWN'
  splitBlox = [{}]

  for pt, pc, px in ptc:
    # double-check for missing text
    xmlText = BeautifulSoup(px, 'xml').text
    xmlText = ' '.join([s for s in xmlText.split() if s != ''])
    if len(xmlText) > len(pt): pt = xmlText

    # initiate
    if not pt:
      if splitBlox[-1] != {}:
        splitBlox.append({})
      continue
    if pc == '20752E':
      curSectn = pt.strip()
      continue
    if splitBlox[-1] == {}:
      splitBlox[-1]['section'] = curSectn
      splitBlox[-1]['raw'] = [] 
      splitBlox[-1]['Name'] = []
      splitBlox[-1]['address_raw'] = []

    # collect
    splitBlox[-1]['raw'].append(pt)   
    if pc == 'D12229':
      splitBlox[-1]['Name'].append(pt)
    elif re.search("^Te.*:.*", pt):
      splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
    elif re.search("^Mob.*:.*", pt):
      splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
    elif pt.startswith('Email:') or re.search(".*[@].*[.].*", pt):
      splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
    else:
      splitBlox[-1]['address_raw'].append(pt)
  
  # some cleanup
  if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
  for i in range(len(splitBlox)):
    addrsParas = splitBlox[i]['address_raw'] # for later

    # join lists into strings
    splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
    for k in ['raw', 'address_raw']:
      splitBlox[i][k] = '\n'.join(splitBlox[i][k])

    # search address for City, State and PostCode
    apLast = addrsParas[-1].split(',')[-1]
    maybeCity = [ap for ap in addrsParas if '–' in ap]
    if '–' not in apLast:
      splitBlox[i]['State'] = apLast.strip()
    if maybeCity:
      maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
      maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
      splitBlox[i]['City'] = maybeCity.strip()
      splitBlox[i]['PostCode'] = maybePIN.strip()
    
    # add mobile to tel
    if 'mobile_raw' in splitBlox[i]:
      if 'tel_raw' not in splitBlox[i]:
        splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
      else:
        splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
      del splitBlox[i]['mobile_raw']

    # split tel [as needed]
    if 'tel_raw' in splitBlox[i]:
      tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')] 
      telNum = []

      for t in range(len(tel_i)):
        if '/' in tel_i[t]:
          tns = [t.strip() for t in tel_i[t].split('/')]
          tel1 = tns[0]
          telNum.append(tel1)
          for tn in tns[1:]:
            telNum.append(tel1[:-1*len(tn)]+tn)
        else:
          telNum.append(tel_i[t])
      
      splitBlox[i]['Tel_1'] = telNum[0]
      splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
  
  return splitBlox
  • (Since I was getting font color anyway, I decided to add another column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
  • Since "raw" is saved, any other value can be double checked manually at least.
  • The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
    • I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
    • Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".

After this, you can just view as DataFrame with:

#import docx
#import pandas 

content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
    'section', 'Name', 'address_raw', 'City', 
    'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Upvotes: 1

Related Questions