Reputation: 3156
I have a dataframe (df) (orginally from a excel file) and the first 9 rows are like this:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP
I want to transform this dataframe (e.g. first 9 rows) to this:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00 OC/OER/OPA/PMS/ M WEBB,DATA CORP
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-FO/FLA-DO/FLA-CB/ G KAN,PALM POST
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/ M EGAN,BERNSTEIN LIEBHARD & LONDON LLP
So basically:
Can anyone help me out pls? This is the code im trying to get it to work:
for i, row in df.iterrows():
if pd.isnull(df.ix[i]['Control_#']):
df.ix[i-1]['Recd_Date/Due_Date'] = str(df.ix[i-1]['Recd_Date/Due_Date'])+'/'+str(df.ix[i]['Recd_Date/Due_Date'])
df.ix[i-1]['Subject'] = str(df.ix[i-1]['Subject'])+' '+str(df.ix[i]['Subject'])
if str(df.ix[i-1]['Action_Office'])[-1] == '-':
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+str(df.ix[i]['Action_Office'])
else:
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+','+str(df.ix[i]['Action_Office'])
if pd.isnull(df.ix[i-1]['Signature/Requester']):
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+str(df.ix[i]['Signature/Requester'])
elif str(df.ix[i-1]['Signature/Requester'])[-1] == '&':
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+' '+str(df.ix[i]['Signature/Requester'])
else:
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+','+str(df.ix[i]['Signature/Requester'])
df.drop(df.index[i])
How come the drop() doesn't work? I am trying drop the current row (if its ['Control_#'] is null) so the next row (whose ['Control_#'] is null) can be added to the previous row (whose ['Control_#'] is NOT null) iteratively..
Much appreciated!!
Upvotes: 2
Views: 10668
Reputation: 4051
I think you need to group the rows together and then join up the column values. The tricky part is finding a way to group together the rows in the way you want. Here is my solution...
Since your groups depend on a sequence in your rows I used a static variable in a method to label every row to a specific group
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
This method is applied along the Control series to sort indexes into groups, which is then used to split up the dataframe to allow you to merge rows
#groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
That is really the only tricky part after that you can just merge the rows by applying a function to each group that gives you your desired output
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
def joinFunc(g,column):
col =g[column]
joiner = "/" if column == "Action" else ","
s = joiner.join([str(each) for each in col if pd.notnull(each)])
s = re.sub("(?<=&)"+joiner," ",s) #joiner = " "
s = re.sub("(?<=-)"+joiner,"",s) #joiner = ""
s = re.sub(joiner*2,joiner,s) #fixes double joiner condition
return s
#edit above - str(each) - to convert to strings... edit above regex to clean join string joins
if __name__ == "__main__":
df = """ Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP"""
df = pd.read_csv(StringIO.StringIO(df),sep = "\s\s+",engine='python')
groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
groupFunct = lambda g: pd.Series([joinFunc(g,col) for col in g.columns],index=g.columns)
print groups.apply(groupFunct)
output
Control Recd_Date/Due_Date \
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00
Action \
0 OC/OER/OPA/PMS/
1 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-...
2 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/
Signature/Requester
0 M WEBB,DATA CORP
1 G KAN,PALM POST
2 M EGAN,BERNSTEIN LIEBHARD & LONDON LLP
Upvotes: 6