Reputation: 119
I am quite new to panda (just a few days getting my hands on it), although I am still in progress to learn and to explore on using the Pandas. I have a big size of csv file consist of hundred thousands of rows. My objective is to concatenate multiple row into a single row based on multiple column. On top of that, by referring the date/time as well which later need to be included. Below illustrate my csv file.
Body UDH Original Sender ID Received Date/Time
Hi John, Can You ABC0010101 GGQMS 01/02/2001 01:03:19
Wait A moment? ABC0010102 GGQMS 01/02/2001 01:03:20
Whats is 050004000111 112233445566 01/03/2001 11:16:01
Carrine Doing 050004000112 112233445566 01/03/2001 11:16:01
Over There? 050004000113 112233445566 01/03/2001 11:16:02
Where is CD10F1011 zwerty 01/03/2001 15:22:10
Your Homework? CD10F1012 zwerty 01/03/2001 15:22:11
Order for Pizza AACCDD55001 112233445566 01/04/2001 19:20:21
Now for cheap $. AACCDD55002 112233445566 01/04/2001 19:20:22
John, you know G0500781 GGQMS 01/04/2001 10:21:21
Where can I get it? G0500782 GGQMS 01/04/2001 10:21:21
As you can see above is my csv file. The UDH here act as the primary key, as per number of characters (first until second last) is where we can identify the body is belongs to. Another part is the Received Date/Time where second part of body received a 1 second late or maybe more than 1 second.
I've managed to concatenation the body, however, certain body consist of third part which I didn't manage to concat the body altogether.
Below is my current codes:
def problem3():
filep2 = pd.read_csv(r'/Users/John/Downloads/Practice1/my_r.csv')
#data cleaning
filep2['Received Date/Time']= filep2['Received Date/Time'].astype('datetime64[ns]')
filep2['UDH']=filep2['UDH'].astype(object)
filep2['Original Sender ID']=filep2['Original Sender ID'].astype(object)
filep2['Account User Name']=filep2['Account User Name'].astype(object)
filep2['Body']=filep2['Body'].astype(str)
filep2['UDH']=filep2['UDH'].str.strip()
df = pd.DataFrame(filep2)
#Filter null row in UDH column
df=df[df['UDH'].notnull()]
df=df.sort_values(by ='UDH')
df['Body'] = df.apply(multiple_condition, axis=1)
df.to_csv(r'/Users/John/Downloads/Practice1/my_c.csv', index=False, header=True)
def multiple_condition (df):
if (df['UDH'].str.len() == 8):
df=df.groupby(df[['UDH'].str[:7],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
elif (df['UDH'].str.len() == 9):
df= df.groupby(df[['UDH'].str[:8],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
elif (df['UDH'].str.len() == 10):
df= df.groupby(df[['UDH'].str[:9],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
elif (df['UDH'].str.len() == 11):
df=df.groupby(df[['UDH'].str[:10],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
elif (df['UDH'].str.len() == 12):
df=df.groupby(df[['UDH'].str[:11],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
Above code gave the error as stated as the subject of this topic/ticket. Error message stated as follows;
UPDATED ERROR MESSAGE
Traceback (most recent call last):
File "<ipython-input-85-8ca58b5f49ad>", line 1, in <module>
runfile('/Users/syafiq/Downloads/RoutingPractice01.py', wdir='/Users/syafiq/Downloads')
File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/John/Downloads/RoutingPractice01.py", line 79, in <module>
problem3()
File "/Users/John/Downloads/RoutingPractice01.py", line 35, in problem3
filep2['Received Date/Time']= filep2['Received Date/Time'].astype('datetime64[ns]')
File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2980, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Received Date/Time'
Below is the desired output:
Body Original Sender ID Received Date/Time
Hi John, Can You Wait A Moment? GGQMS 01/02/2001 01:03:20
What Is carbine doing over there? 112233445566 01/03/2001 11:16:02
Where is your homework? zwerty 01/03/2001 15:22:11
Order for Pizza Now for cheap $ 112233445566 01/04/2001 19:20:22
John, you know where can I get it? GGQMS 01/04/2001 10:21:21
Noted: I've try multiple way on to obtain the desired output above, still couldn't resolve/error occurred. I've tried countless time with different approach, still no dice, keep hitting the brick wall. The UDH is the Identifier for group the body
I'm still new to Pandas and haven't get my hands dirty with Python in a while. I do appreciate if somebody could highlight where did I do wrong. And as well would be much grateful with your help on obtaining my desired output.
Thank you very much and very appreciated! :)
Upvotes: 1
Views: 339
Reputation: 142631
I can get (more or less) expected result without apply()
but using directly groupby()
groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
df2 = groups.agg({'Body':' '.join, 'Received Date/Time':max}).reset_index()
I use io.StringIO()
only to simulate file.
text = ''' Body UDH Original Sender ID Received Date/Time
Hi John, Can You ABC0010101 GGQMS 01/02/2001 01:03:19
Wait A moment? ABC0010102 GGQMS 01/02/2001 01:03:20
Whats is 050004000111 112233445566 01/03/2001 11:16:01
Carrine Doing 050004000112 112233445566 01/03/2001 11:16:01
Over There? 050004000113 112233445566 01/03/2001 11:16:02
Where is CD10F1011 zwerty 01/03/2001 15:22:10
Your Homework? CD10F1012 zwerty 01/03/2001 15:22:11
Order for Pizza AACCDD55001 112233445566 01/04/2001 19:20:21
Now for cheap $. AACCDD55002 112233445566 01/04/2001 19:20:22
John, you know G0500781 GGQMS 01/04/2001 10:21:21
Where can I get it? G0500782 GGQMS 01/04/2001 10:21:21'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text), sep='\s{2,}')
#df['Received Date/Time'] = df['Received Date/Time'].astype('datetime64[ns]')
#df['UDH'] = df['UDH'].astype(object)
#df['Original Sender ID'] = df['Original Sender ID'].astype(object)
#df['Account User Name'] = df['Account User Name'].astype(object)
#df['Body'] = df['Body'].astype(str)
#df['UDH'] = df['UDH'].str.strip()
#Filter null row in UDH column
#df = df[df['UDH'].notnull()]
#df = df.sort_values(by ='UDH')
#groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
#for name, data in groups:
#print(name)
# data['Received Date/Time'] = data['Received Date/Time'].min()
#print(data)
groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
df2 = groups.agg({'Body':' '.join, 'Received Date/Time':max}).reset_index()
#groups = df.groupby([df['UDH'].str[:-1]])
#df2 = groups.agg({'Body':' '.join, 'Received Date/Time':max, 'Original Sender ID':min}).reset_index()
df2 = df2.sort_values('Received Date/Time')
pd.options.display.width = 200
print(df2)
Result
UDH Original Sender ID Body Received Date/Time
2 ABC001010 GGQMS Hi John, Can You Wait A moment? 01/02/2001 01:03:20
0 05000400011 112233445566 Whats is Carrine Doing Over There? 01/03/2001 11:16:02
3 CD10F101 zwerty Where is Your Homework? 01/03/2001 15:22:11
4 G050078 GGQMS John, you know Where can I get it? 01/04/2001 10:21:21
1 AACCDD5500 112233445566 Order for Pizza Now for cheap $. 01/04/2001 19:20:22
Upvotes: 1