How can I use the .findall() function for a excel file iterating through all rows of a column?

Question

I have a big excel sheet with information about different companies altogether in a single cell for each company and my goal is to separate this into different columns following patterns to scrape the info from the first column. The original data looks like this:

My goal is to achieve a dataframe like this:

I have created the following code to use the patterns Mr., Affiliation:, E-mail:, and Mobile because they are repeated in every single row the same way. However, I don't know how to use the findall() function to scrape all the info I want from each row of the desired column.

import openpyxl
import re
import sys  
import pandas as pd
reload(sys)  
sys.setdefaultencoding('utf8')
wb = openpyxl.load_workbook('/Users/ap/info1.xlsx')
ws = wb.get_sheet_by_name('Companies')
w={'Name': [],'Affiliation': [], 'Email':[]}
for row in ws.iter_rows('C{}:C{}'.format(ws.min_row,ws.max_row)):
    for cells in row:
        a=re.findall(r'Mr.(.*?)Affiliation:',aa, re.DOTALL)
        a1="".join(a).replace('
',' ')
        b=re.findall(r'Affiliation:(.*?)E-mail',aa,re.DOTALL)
        b1="".join(b).replace('
',' ')
        c=re.findall(r'E-mail(.*?)Mobile',aa,re.DOTALL)
        c1="".join(c).replace('
',' ')
        w['Name'].append(q1)
        w['Affiliation'].append(r1)
        w['Email'].append(s1)
        print cell.value
        
df=pd.DataFrame(data=w)
df.to_excel(r'/Users/ap/info2.xlsx')

Anna Nevison · Accepted Answer

I would go with this, which just replaces the 'E-mail:...' with a delimiter and then splits and assigns to the right column

df['Name'] = np.nan
df['Affiliation'] = np.nan
df['Email'] = np.nan
df['Mobile'] = np.nan

for i in range(0, len(df)):
    full_value = df['Companies'].loc[i]
    full_value = full_value.replace('Affiliation:', ';').replace('E-mail:', ';').replace('Mobile:', ';')
    full_value = full_value.split(';')
    df['Name'].loc[i] = full_value[0]
    df['Affiliation'].loc[i] = full_value[1]
    df['Email'].loc[i] = full_value[2]
    df['Mobile'].loc[i] = full_value[3]

del df['Companies']
print(df)

How can I use the .findall() function for a excel file iterating through all rows of a column?

Answers (1)

Related Questions