Bowen Peng
Bowen Peng

Reputation: 1815

How to get rows of a most recent day in the ascending order of time way when reading csv file?

I want to get rows of a most recent day which is in ascending order of time way.
I get dataframe as follows:

    label      uId          adId    operTime                    siteId  slotId  contentId   netType
0   0          u147333631   3887    2019-03-30 15:01:55.617     10      30      2137        1
1   0          u146930169   1462    2019-03-31 09:51:15.275     3       32      1373        1
2   0          u139816523   2084    2019-03-27 08:10:41.769     10      30      2336        1
3   0          u106546472   1460    2019-03-31 08:51:41.085     3       32      1371        4
4   0          u106642861   2295    2019-03-27 22:58:03.679     3       32      2567        4

Cause I get about 100 million rows in this csv file, it is impossible to load all this into my PC memory.
So I want to get rows of a most recent day in ascending order of time way when reading this csv files.
For examples, if the most recent day is on 2019-04-04, it will output as follows:

#this not a real data, just for examples.
    label      uId          adId    operTime                    siteId  slotId  contentId   netType
0   0          u147336431   3887    2019-04-04 00:08:42.315     1       54      2427        2
1   0          u146933269   1462    2019-04-04 01:06:16.417     30      36      1343        6
2   0          u139536523   2084    2019-04-04 02:08:58.079     15      23      1536        7
3   0          u106663472   1460    2019-04-04 03:21:13.050     32      45      1352        2
4   0          u121642861   2295    2019-04-04 04:36:08.653     3       33      3267        4

Could anyone help me?
Thanks in advence.

Upvotes: 1

Views: 53

Answers (3)

vlemaistre
vlemaistre

Reputation: 3331

Like @anky_91 mentionned, you can use the sort_values function. Here is a short example of how it works:

df = pd.DataFrame( {'Symbol':['A','A','A'] ,
    'Date':['02/20/2015','01/15/2016','08/21/2015']})
df.sort_values(by='Date')

Out :

Date    Symbol
2   08/21/2015  A
0   02/20/2015  A
1   01/15/2016  A

Upvotes: 0

iamchoosinganame
iamchoosinganame

Reputation: 1120

I'm assuming you can't read the entire file into memory, and the file is in a random order. You can read the file in chunks and iterate through the chunks.

# read 50,000 lines of the file at a time
reader = pd.read_csv(
    'csv_file.csv',
    parse_dates=True,
    chunksize=5e5,
    header=0
)

recent_day=pd.datetime(2019,4,4)
next_day=recent_day + pd.Timedelta(days=1)
df_list=[]

for chunk in reader:
    #check if any rows match the date range
    date_rows = chunk.loc[
        (chunk['operTime'] >= recent_day]) &\
        (chunk['operTime'] < next_day)
    ]
    #append dataframe of matching rows to the list
    if date_rows.empty:
        pass
    else:
        df_list.append(date_rows)


final_df = pd.concat(df_list)
final_df = final_df.sort_values('operTime')

Upvotes: 1

Eli
Eli

Reputation: 104

Seconding what anky_91 said, sort_values() will be helpful here.

import pandas as pd

df = pd.read_csv('file.csv')

# >>> df
#    label         uId  adId                 operTime  siteId  slotId  contentId  netType
# 0      0  u147333631  3887  2019-03-30 15:01:55.617      10      30       2137        1
# 1      0  u146930169  1462  2019-03-31 09:51:15.275       3      32       1373        1
# 2      0  u139816523  2084  2019-03-27 08:10:41.769      10      30       2336        1
# 3      0  u106546472  1460  2019-03-31 08:51:41.085       3      32       1371        4
# 4      0  u106642861  2295  2019-03-27 22:58:03.679       3      32       2567        4

sub_df = df[(df['operTime']>'2019-03-31') & (df['operTime']<'2019-04-01')]

# >>> sub_df
#    label         uId  adId                 operTime  siteId  slotId  contentId  netType
# 1      0  u146930169  1462  2019-03-31 09:51:15.275       3      32       1373        1
# 3      0  u106546472  1460  2019-03-31 08:51:41.085       3      32       1371        4

final_df = sub_df.sort_values(by=['operTime'])

# >>> final_df
#    label         uId  adId                 operTime  siteId  slotId  contentId  netType
# 3      0  u106546472  1460  2019-03-31 08:51:41.085       3      32       1371        4
# 1      0  u146930169  1462  2019-03-31 09:51:15.275       3      32       1373        1

I think you could also use a datetimeindex here; that might be necessary if the file is sufficiently large.

Upvotes: 0

Related Questions