hula-hula
hula-hula

Reputation: 119

new column in Pandas python based on condition

I am quite new to pandas, hence, I need help from you experts out there!

I'm quite confusing on concatenation the data from a multiple row.

#copy selected row and column. Set specific column into a appropriate data type
filep2 = pd.read_csv(r'/Users/syafiq/Downloads/RoutingPractice01/my_raw.csv')
#set column date/time and idd datatype
filep2['Received Date/Time']= filep2['Received Date/Time'].astype('datetime64[ns]')
filep2['IDD']=filep2['IDD'].astype(str)
#select the specific column
df = pd.DataFrame(filep2, columns=['Account User Name','Sender','Body','IDD','Received Date/Time'])
df2= df.loc[df['IDD'].notnull(),['Account User Name','Sender','Body','IDD','Received Date/Time']]
df = pd.DataFrame(df2)
#create new column as Concat
concat =''
df['Concat']=concat
print(df2)

My csv file "my_raw.csv" contains thousands thousands of row with 15 columns.

While the body column content a message. The IDD column contain an assign key to that specific body message.

For example in my csv file, as follows;

Sender          Body                      UDH              Date/Time
ABC     Hello                          CD30010101       01/01/20 1:57
ABC     John                           CD30010102       01/01/20 1:58
XYZ     Please Wait for your turn      3300020201       01/01/20 17:57
XYZ     While waiting for our staff    3300020202       01/01/20 17:58
XYZ     To Complete his task           3300020203       01/01/20 17:59
ABC     Your Parcel is Here            1100DCB001       03/01/20 11:57

As you can see above, I want to concatenate the body into a single line and copy into a new column (which I already name it as a Concat). While selecting the date of the last the message were received.

For an example: ABC with UDH CD30010101 and CD30010102, the body message need to group together and copy it into a new column. While copy the last date/time received which is 01/01/20 1:58 into a new column as well.

Desired Output:

                 Concat                                 Date/Time 
(ROW 1) Hello John                                    01/01/20 1:58
(ROW 2) Please wait for your turn while waiting
        for our staff to complete his task            01/01/20 17:59

I've been trying it for days, still no dice, keep hitting the brick wall.

Need your guidance and expertise!!

Appreciate and thank you very much!

Upvotes: 0

Views: 150

Answers (4)

hula-hula
hula-hula

Reputation: 119

I figure it out, by using .str[:-1] in UDH column while selecting the last date of the received date/time.

groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
df = groups.agg({'Body':''.join, 'Received Date/Time':max}).reset_index()
df = df.sort_values('Received Date/Time')
pd.options.display.width = 200
print(df.sort_values('Received Date/Time'))

Hope it helps the others out there who trying to implement the same thing

Upvotes: 0

Toukenize
Toukenize

Reputation: 1420

Welcome to Stack Overflow.

Edited based on your file actual column name and desired column name:

I just realize you want to group them by Sender as well as UHG that starts with the same characters (how many? You didn't specify, so I just stick to 5 chars in my example below).

The following should achieve what you require:

def concat_series_to_string(series):
    return ' '.join(series)

df['Received Date/Time'] = pd.to_datetime(df['Received Date/Time'])

# Update according to your actual session identification method.
df['UDH_session'] = df['UDH'].str[:5] 

df_concat = df\
    .groupby(['Sender','UDH_session'])\
    .agg({'Body':[concat_series_to_string],'Received Date/Time':['last']})\

df_concat.columns = ['Concat','Received Date/Time (last)']
df_concat = df_concat.sort_values('Received Date/Time (last)')

which gives you: enter image description here

Upvotes: 3

Serge Ballesta
Serge Ballesta

Reputation: 148910

I will assume that you only want to (almost) concat consecutive lines with same sender. In fact the required operation is probably more a ' '.join to ensure a space between consecutive lines.

You can identify groups on consecutive lines with same sender with (df['Sender'] != df['Sender'].shift()).cumsum().

In the end, you can do:

resul = df.groupby((df['Sender'] != df['Sender'].shift()).cumsum()
                    ).agg({'Sender': 'first', 'Body': ' '.join, 'Date/Time': max}
                         ).rename_axis(None)

It gives:

  Sender                                               Body       Date/Time
1    ABC                                         Hello John   01/01/20 1:58
2    XYZ  Please Wait for your turn While waiting for ou...  01/01/20 17:59
3    ABC                                Your Parcel is Here  03/01/20 11:57

Upvotes: 0

Jeril
Jeril

Reputation: 8521

Can you try the following:

df = pd.read_csv(<csv file name>)
df_concat = pd.DataFrame(columns=['Concat', 'Date/Time'])
df_concat['Concat'] = df.groupby('Sender')['Body'].apply(' '.join)
df_concat['Date/Time'] = df.groupby('Sender')['Date/Time'].last()

Upvotes: 0

Related Questions