Chris T.
Chris T.

Reputation: 1799

Chronologically sorting date time (YYYY-MM-DD) series with precision down to the level of day

I have a pandas datetime series column (in string format) that I would like to have chronologically-sorted. Since the original series are already converted to the YYYY-MM-DD timestamp format like the following:

0     1993-03-25
1     1985-06-18
2     1971-07-08
3     1975-09-27
4     2000-02-06
5     1979-07-06
6     1978-05-18
7     1989-10-24
8     1989-10-24
9     1971-04-10
10    1985-05-11
11    2011-04-09
12    1998-08-01
13    1972-01-26
14    1990-05-24

Note: it's only a small fraction of the data, it's for illustration purpose

I want to sort them by chronological order down to the precision of 'day' (year --> month --> day) and rank them by their indices in the original series where new index column is on the left and the original (sorted) indices for each datetime string are on the right based on their chronological ranking:

0     10
1     7
2     1
3     3
4     12
5     5
6     4
7     8
8     8
9     0
10    6
11    13
12    11
13    2
14    9

However, note that there are instances where datetime strings are tied, for example, df[7] and df[8] are the same day, and thus getting the same rank 8.

I have used methods like .rank(method='dense').sub(1).astype(int) and .sort_values(kind='mergesort') to sort this datetime series by the sequence of year-month-day, but can't seem to get rid of the 'tied' issue.

Are there better approaches to tackle this tied rank issue and get the output I want?

Thank you.

New Edit

I used the following code to generate the df, the .txt file contains a large number of unorganized text string from which I extracted datetime elements using re.findall(r' ') function.

import pandas as pd
import re  
import datetime

#load text string
doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)

# extract datetimes from different datetime patterns, the extracted datetime elements are in string format contained in list [] object

df['date'] = df.str.findall(r'\b....\b')

# manually replace some irregular patterns/expressions
df['date'].iloc[...] = ['10/21/79']
df['date'].iloc[...] = ['7/11/2000']
            ...
df['date'].drop('date', inplace=True)

# convert list object in each cell to string
df['date'] = df['date'].apply(lambda x: ', '.join(x))

# convert to datetime format and check for NaT cell.
df['date'] = df['date'].apply(lambda x: pd.to_datetime(x, errors='coerce'))


   
The output in each cell of this series is in the form YYYY-MM-DD and dtype = timestamp.

Upvotes: 0

Views: 3054

Answers (1)

Cory Madden
Cory Madden

Reputation: 5193

You can add a column with the dates as a datetime object and then sort by that.

In [103]: df = pd.DataFrame.from_csv('t.csv', header=0, sep='\s+', index_col='id')

In [105]: df['date2'] = df.date.astype('datetime64[ns]')

In [106]: df.sort_values('date2')

Out[106]: 
          date      date2
id                       
9   1971-04-10 1971-04-10
2   1971-07-08 1971-07-08
13  1972-01-26 1972-01-26
3   1975-09-27 1975-09-27
6   1978-05-18 1978-05-18
5   1979-07-06 1979-07-06
10  1985-05-11 1985-05-11
1   1985-06-18 1985-06-18
7   1989-10-24 1989-10-24
8   1989-10-24 1989-10-24
14  1990-05-24 1990-05-24
0   1993-03-25 1993-03-25
12  1998-08-01 1998-08-01
4   2000-02-06 2000-02-06
11  2011-04-09 2011-04-09

And if you want to add the ranking column:

In [112]: df['sorting'] = df.sort_values('date2').index

In [113]: df.sorting
Out[113]: 
id
0      9
1      2
2     13
3      3
4      6
5      5
6     10
7      1
8      7
9      8
10    14
11     0
12    12
13     4
14    11
Name: sorting, dtype: int64

Since your csv doesn't actually have a header row like I added, do this:

In [132]: df=pd.DataFrame.from_csv('t.csv', header=None, sep='\s+')
In [133]: df[2] = df[1].astype('datetime64[ns]')
In [134]: df[3] = df.sort_values(2).index
In [135]: df[3]
Out[135]: 
0
0      9
1      2
2     13
3      3
4      6
5      5
6     10
7      1
8      7
9      8
10    14
11     0
12    12
13     4
14    11
Name: 3, dtype: int64

OK, assuming they're already Timestamp objects or whatever as defined in the last line of the provided code, you can just sort them as they are:

In [194]: df = pd.DataFrame.from_csv('dates.txt', sep='\s+')

In [195]: df['date'] = df['date'].apply(lambda x: pd.to_datetime(x, errors='coerce'
     ...: ))

In [196]: df['sorting'] = df['date'].sort_values().index

In [197]: df
Out[197]: 
         date  sorting
id                    
0  1993-03-25        9
1  1985-06-18        2
2  1971-07-08       13
3  1975-09-27        3
4  2000-02-06        6
5  1979-07-06        5
6  1978-05-18       10
7  1989-10-24        1
8  1989-10-24        7
9  1971-04-10        8
10 1985-05-11       14
11 2011-04-09        0
12 1998-08-01       12
13 1972-01-26        4
14 1990-05-24       11

Upvotes: 1

Related Questions