Reputation: 181
I have a big excel sheet with information about different companies altogether in a single cell for each company and my goal is to separate this into different columns following patterns to scrape the info from the first column. The original data looks like this:
My goal is to achieve a dataframe like this:
I have created the following code to use the patterns Mr., Affiliation:, E-mail:, and Mobile because they are repeated in every single row the same way. However, I don't know how to use the findall() function to scrape all the info I want from each row of the desired column.
import openpyxl
import re
import sys
import pandas as pd
reload(sys)
sys.setdefaultencoding('utf8')
wb = openpyxl.load_workbook('/Users/ap/info1.xlsx')
ws = wb.get_sheet_by_name('Companies')
w={'Name': [],'Affiliation': [], 'Email':[]}
for row in ws.iter_rows('C{}:C{}'.format(ws.min_row,ws.max_row)):
for cells in row:
a=re.findall(r'Mr.(.*?)Affiliation:',aa, re.DOTALL)
a1="".join(a).replace('\n',' ')
b=re.findall(r'Affiliation:(.*?)E-mail',aa,re.DOTALL)
b1="".join(b).replace('\n',' ')
c=re.findall(r'E-mail(.*?)Mobile',aa,re.DOTALL)
c1="".join(c).replace('\n',' ')
w['Name'].append(q1)
w['Affiliation'].append(r1)
w['Email'].append(s1)
print cell.value
df=pd.DataFrame(data=w)
df.to_excel(r'/Users/ap/info2.xlsx')
Upvotes: 1
Views: 430
Reputation: 2759
I would go with this, which just replaces the 'E-mail:...' with a delimiter and then splits and assigns to the right column
df['Name'] = np.nan
df['Affiliation'] = np.nan
df['Email'] = np.nan
df['Mobile'] = np.nan
for i in range(0, len(df)):
full_value = df['Companies'].loc[i]
full_value = full_value.replace('Affiliation:', ';').replace('E-mail:', ';').replace('Mobile:', ';')
full_value = full_value.split(';')
df['Name'].loc[i] = full_value[0]
df['Affiliation'].loc[i] = full_value[1]
df['Email'].loc[i] = full_value[2]
df['Mobile'].loc[i] = full_value[3]
del df['Companies']
print(df)
Upvotes: 1