Reputation: 1815
I want to get rows of a most recent day which is in ascending order of time way.
I get dataframe as follows:
label uId adId operTime siteId slotId contentId netType
0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
Cause I get about 100 million rows in this csv file, it is impossible to load all this into my PC memory.
So I want to get rows of a most recent day in ascending order of time way when reading this csv files.
For examples, if the most recent day is on 2019-04-04, it will output as follows:
#this not a real data, just for examples.
label uId adId operTime siteId slotId contentId netType
0 0 u147336431 3887 2019-04-04 00:08:42.315 1 54 2427 2
1 0 u146933269 1462 2019-04-04 01:06:16.417 30 36 1343 6
2 0 u139536523 2084 2019-04-04 02:08:58.079 15 23 1536 7
3 0 u106663472 1460 2019-04-04 03:21:13.050 32 45 1352 2
4 0 u121642861 2295 2019-04-04 04:36:08.653 3 33 3267 4
Could anyone help me?
Thanks in advence.
Upvotes: 1
Views: 53
Reputation: 3331
Like @anky_91 mentionned, you can use the sort_values function. Here is a short example of how it works:
df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
df.sort_values(by='Date')
Out :
Date Symbol
2 08/21/2015 A
0 02/20/2015 A
1 01/15/2016 A
Upvotes: 0
Reputation: 1120
I'm assuming you can't read the entire file into memory, and the file is in a random order. You can read the file in chunks and iterate through the chunks.
# read 50,000 lines of the file at a time
reader = pd.read_csv(
'csv_file.csv',
parse_dates=True,
chunksize=5e5,
header=0
)
recent_day=pd.datetime(2019,4,4)
next_day=recent_day + pd.Timedelta(days=1)
df_list=[]
for chunk in reader:
#check if any rows match the date range
date_rows = chunk.loc[
(chunk['operTime'] >= recent_day]) &\
(chunk['operTime'] < next_day)
]
#append dataframe of matching rows to the list
if date_rows.empty:
pass
else:
df_list.append(date_rows)
final_df = pd.concat(df_list)
final_df = final_df.sort_values('operTime')
Upvotes: 1
Reputation: 104
Seconding what anky_91 said, sort_values() will be helpful here.
import pandas as pd
df = pd.read_csv('file.csv')
# >>> df
# label uId adId operTime siteId slotId contentId netType
# 0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
sub_df = df[(df['operTime']>'2019-03-31') & (df['operTime']<'2019-04-01')]
# >>> sub_df
# label uId adId operTime siteId slotId contentId netType
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
final_df = sub_df.sort_values(by=['operTime'])
# >>> final_df
# label uId adId operTime siteId slotId contentId netType
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
I think you could also use a datetimeindex here; that might be necessary if the file is sufficiently large.
Upvotes: 0