Alicia Pliego
Alicia Pliego

Reputation: 181

How can I use the .findall() function for a excel file iterating through all rows of a column?

I have a big excel sheet with information about different companies altogether in a single cell for each company and my goal is to separate this into different columns following patterns to scrape the info from the first column. The original data looks like this:

enter image description here

My goal is to achieve a dataframe like this:

enter image description here

I have created the following code to use the patterns Mr., Affiliation:, E-mail:, and Mobile because they are repeated in every single row the same way. However, I don't know how to use the findall() function to scrape all the info I want from each row of the desired column.

import openpyxl
import re
import sys  
import pandas as pd
reload(sys)  
sys.setdefaultencoding('utf8')
wb = openpyxl.load_workbook('/Users/ap/info1.xlsx')
ws = wb.get_sheet_by_name('Companies')
w={'Name': [],'Affiliation': [], 'Email':[]}
for row in ws.iter_rows('C{}:C{}'.format(ws.min_row,ws.max_row)):
    for cells in row:
        a=re.findall(r'Mr.(.*?)Affiliation:',aa, re.DOTALL)
        a1="".join(a).replace('\n',' ')
        b=re.findall(r'Affiliation:(.*?)E-mail',aa,re.DOTALL)
        b1="".join(b).replace('\n',' ')
        c=re.findall(r'E-mail(.*?)Mobile',aa,re.DOTALL)
        c1="".join(c).replace('\n',' ')
        w['Name'].append(q1)
        w['Affiliation'].append(r1)
        w['Email'].append(s1)
        print cell.value
        
df=pd.DataFrame(data=w)
df.to_excel(r'/Users/ap/info2.xlsx')  

Upvotes: 1

Views: 430

Answers (1)

Anna Nevison
Anna Nevison

Reputation: 2759

I would go with this, which just replaces the 'E-mail:...' with a delimiter and then splits and assigns to the right column

df['Name'] = np.nan
df['Affiliation'] = np.nan
df['Email'] = np.nan
df['Mobile'] = np.nan

for i in range(0, len(df)):
    full_value = df['Companies'].loc[i]
    full_value = full_value.replace('Affiliation:', ';').replace('E-mail:', ';').replace('Mobile:', ';')
    full_value = full_value.split(';')
    df['Name'].loc[i] = full_value[0]
    df['Affiliation'].loc[i] = full_value[1]
    df['Email'].loc[i] = full_value[2]
    df['Mobile'].loc[i] = full_value[3]

del df['Companies']
print(df)

Upvotes: 1

Related Questions