Reputation: 119
I am quite new to pandas, hence, I need help from you experts out there!
I'm quite confusing on concatenation the data from a multiple row.
#copy selected row and column. Set specific column into a appropriate data type
filep2 = pd.read_csv(r'/Users/syafiq/Downloads/RoutingPractice01/my_raw.csv')
#set column date/time and idd datatype
filep2['Received Date/Time']= filep2['Received Date/Time'].astype('datetime64[ns]')
filep2['IDD']=filep2['IDD'].astype(str)
#select the specific column
df = pd.DataFrame(filep2, columns=['Account User Name','Sender','Body','IDD','Received Date/Time'])
df2= df.loc[df['IDD'].notnull(),['Account User Name','Sender','Body','IDD','Received Date/Time']]
df = pd.DataFrame(df2)
#create new column as Concat
concat =''
df['Concat']=concat
print(df2)
My csv file "my_raw.csv" contains thousands thousands of row with 15 columns.
While the body column content a message. The IDD column contain an assign key to that specific body message.
For example in my csv file, as follows;
Sender Body UDH Date/Time
ABC Hello CD30010101 01/01/20 1:57
ABC John CD30010102 01/01/20 1:58
XYZ Please Wait for your turn 3300020201 01/01/20 17:57
XYZ While waiting for our staff 3300020202 01/01/20 17:58
XYZ To Complete his task 3300020203 01/01/20 17:59
ABC Your Parcel is Here 1100DCB001 03/01/20 11:57
As you can see above, I want to concatenate the body into a single line and copy into a new column (which I already name it as a Concat). While selecting the date of the last the message were received.
For an example: ABC with UDH CD30010101 and CD30010102, the body message need to group together and copy it into a new column. While copy the last date/time received which is 01/01/20 1:58 into a new column as well.
Desired Output:
Concat Date/Time
(ROW 1) Hello John 01/01/20 1:58
(ROW 2) Please wait for your turn while waiting
for our staff to complete his task 01/01/20 17:59
I've been trying it for days, still no dice, keep hitting the brick wall.
Need your guidance and expertise!!
Appreciate and thank you very much!
Upvotes: 0
Views: 150
Reputation: 119
I figure it out, by using .str[:-1] in UDH column while selecting the last date of the received date/time.
groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
df = groups.agg({'Body':''.join, 'Received Date/Time':max}).reset_index()
df = df.sort_values('Received Date/Time')
pd.options.display.width = 200
print(df.sort_values('Received Date/Time'))
Hope it helps the others out there who trying to implement the same thing
Upvotes: 0
Reputation: 1420
Welcome to Stack Overflow.
Edited based on your file actual column name and desired column name:
I just realize you want to group them by Sender
as well as UHG
that starts with the same characters (how many? You didn't specify, so I just stick to 5 chars in my example below).
The following should achieve what you require:
def concat_series_to_string(series):
return ' '.join(series)
df['Received Date/Time'] = pd.to_datetime(df['Received Date/Time'])
# Update according to your actual session identification method.
df['UDH_session'] = df['UDH'].str[:5]
df_concat = df\
.groupby(['Sender','UDH_session'])\
.agg({'Body':[concat_series_to_string],'Received Date/Time':['last']})\
df_concat.columns = ['Concat','Received Date/Time (last)']
df_concat = df_concat.sort_values('Received Date/Time (last)')
Upvotes: 3
Reputation: 148910
I will assume that you only want to (almost) concat consecutive lines with same sender. In fact the required operation is probably more a ' '.join
to ensure a space between consecutive lines.
You can identify groups on consecutive lines with same sender with (df['Sender'] != df['Sender'].shift()).cumsum()
.
In the end, you can do:
resul = df.groupby((df['Sender'] != df['Sender'].shift()).cumsum()
).agg({'Sender': 'first', 'Body': ' '.join, 'Date/Time': max}
).rename_axis(None)
It gives:
Sender Body Date/Time
1 ABC Hello John 01/01/20 1:58
2 XYZ Please Wait for your turn While waiting for ou... 01/01/20 17:59
3 ABC Your Parcel is Here 03/01/20 11:57
Upvotes: 0
Reputation: 8521
Can you try the following:
df = pd.read_csv(<csv file name>)
df_concat = pd.DataFrame(columns=['Concat', 'Date/Time'])
df_concat['Concat'] = df.groupby('Sender')['Body'].apply(' '.join)
df_concat['Date/Time'] = df.groupby('Sender')['Date/Time'].last()
Upvotes: 0