Reputation: 13
I'm relatively new to python and pandas (and coding more generally).
I've been trying to figure out the best way to concatenate specific rows in a pd dataframe. I have a df that looks something like
d = {'sheet_index':[1,2,2,3,3,3,4,5],
'Type':['Note','Note','Note','Note','Note','Note','Note','Note'],
'Strings_To_Concat':['string1','string2','string3','string4','string5','string6','string7','string8']}
df = pd.DataFrame(data=d)
df
sheet_index Type Strings_To_Concat
0 1 Note string1
1 2 Note string2
2 2 Note string3
3 3 Note string4
4 3 Note string5
5 3 Note string6
6 4 Note string7
7 5 Note string8
I want to concatenate the strings in the 'Strings_To_Concat' column if they have the same sheet_index value and are of the Type 'Note' while creating null values in duplicate sheet_index values and not doing anything if the sheet_index isn't duplicated. The result should look something like the below:
sheet_index Type Strings_To_Concat
0 1 Note string1
1 2 Note null
2 2 Note string3 string2
3 3 Note null
4 3 Note null
5 3 Note string6 string5 string4
6 4 Note string7
7 5 Note string8
The problem I'm running into is really just an algorithmic one as I've said I'm relatively new to coding. I've tried looping through the pd to identify which rows are designated as Type 'Note' since there are several types in my dataset and from there if the sheet index is the same as the next one concatenate the next one with the current one.
for i in range(0,len(df)-1):
if(df.iloc[i,1] == 'Note' and df.iloc[i,0] != df.iloc[i+1,0]):
df.loc[i,2] = str(df.iloc[i,2])
elif(df.iloc[i,1] == 'Note' and df.iloc[i,0] == df.iloc[i+1,0]):
df.loc[i+1,2] = str(df.iloc[i+1,2]) + ' ' + str(df.iloc[i,2])
However, this is producing a strange result. Its creating a 4th column with the header '2' and giving me:
sheet_index Type Strings_To_Concat 2
0 1 Note string1 string1
1 2 Note string2 null
2 2 Note string3 string3
3 3 Note string4 null
4 3 Note string5 string5 string4
5 3 Note string6 string6
6 4 Note string7 string7
7 5 Note string8 null
Can someone help me explain why its giving me a new column and the loop isn't behaving as I am suspecting it should? Is there a more elegant solution using pandas methods to doing what I'm trying to achieve other than just using basic for loops?
Any help is very much appreciated
Upvotes: 1
Views: 46
Reputation: 71687
Solution:
m1 = df['Type'].eq('Note')
m2 = df['sheet_index'].duplicated(keep='last')
df['col'] = df[m1].groupby('sheet_index')['Strings_To_Concat'].transform(', '.join)
df['col'] = df['col'].fillna(df['Strings_To_Concat']).mask(m1 & m2)
How it works?
Type=Note
, then group by sheet_index
and transform Strings_to_Concat
using ','.join
and assign this result to new col
col
using the values from Strings_To_Concat
so that we can preserve the other values where Type!=Note
mask
the values which are duplicated
and have Type=Note
Result
sheet_index Type Strings_To_Concat col
0 1 Note string1 string1
1 2 Note string2 NaN
2 2 Note string3 string2, string3
3 3 Note string4 NaN
4 3 Note string5 NaN
5 3 Note string6 string4, string5, string6
6 4 Note string7 string7
7 5 Note string8 string8
Upvotes: 1