S P
S P

Reputation: 67

changing relative times to actual dates in a pandas dataframe

I have currently a dataframe I created by scraping google news headlines. One of my columns is "Time", which refers to time of publication of an article.

Unfortunately, for recent articles, google news uses a "relative" date, e.g., 6 hours ago, or 1 day ago instead of Nov 1, 2017.

I really want to convert these relative dates to be consistent with the other entries (so they also say Nov 12, 2017, for example), but I have no idea where to even start on this.

My thoughts are to maybe make a variable which represents todays date, and then do some kind of search through the dataframe for stuff which doesn't match my format, and then to subtract those relative times with the current date. I would also have to make some sort of filter for stuff which has "hours ago" and just have those equal the current date.

I don't really want a solution but rather a general idea of what to read to try to solve this. Am I supposed to try using numpy?

Example of some rows:

     Publication    Time    Headline
0   The San Diego Union-Tribune     6 hours ago     I am not opposed to new therapeutic modalities...
1   Devon Live  13 hours ago    If you're looking for a bargain this Christmas...
15  ABS-CBN News    1 day ago   Now, Thirdy has a chance to do something that ...
26  New York Times  Nov 2, 2017     Shepherds lead their sheep through the centre ...

Upvotes: 2

Views: 1004

Answers (2)

andrew_reece
andrew_reece

Reputation: 21274

Your approach should work. Use Pandas Timedelta to subtract relative dates from the current date.

For example, given your sample data as:

Publication;Time;Headline
The San Diego Union-Tribune;6 hours ago;I am not opposed to new therapeutic modalities
Devon Live;13 hours ago;If you're looking for a bargain this Christmas
ABS-CBN News;1 day ago;Now, Thirdy has a chance to do something that
New York Times;Nov 2, 2017;Shepherds lead their sheep through the centre

Read in the data from the clipboard (although you could just as easily substitute with read_csv() or some other file format):

import pandas as pd
from datetime import datetime

df = pd.read_clipboard(sep=";")

For the dates that are already in date format, Pandas is smart enough to convert them with to_datetime():

absolute_date = pd.to_datetime(df.Time, errors="coerce")

absolute_date
0          NaT
1          NaT
2          NaT
3   2017-11-02
Name: Time, dtype: datetime64[ns]

For the relative dates, once we drop the "ago" part, they're basically in the right format to convert with pd.Timedelta:

relative_date = (datetime.today() - 
                 df.Time.str.extract("(.*) ago", expand=False).apply(pd.Timedelta))

relative_date
0   2017-11-11 17:05:54.143548
1   2017-11-11 10:05:54.143548
2   2017-11-10 23:05:54.143548
3                          NaT
Name: Time, dtype: datetime64[ns]

Now fill in the respective NaN values from each set, absolute and relative (updated to use combine_first(), via Jezrael's answer):

date = relative_date.combine_first(absolute_date)

relative_date
0   2017-11-11 17:06:29.658925
1   2017-11-11 10:06:29.658925
2   2017-11-10 23:06:29.658925
3   2017-11-02 00:00:00.000000
Name: Time, dtype: datetime64[ns]

Finally, pull out just the date from the datetime:

date.dt.date
0    2017-11-11
1    2017-11-11
2    2017-11-10
3    2017-11-02
Name: Time, dtype: object

Upvotes: 2

jezrael
jezrael

Reputation: 862711

You can use to_datetime with to_timedelta first and then use combine_first with floor:

#create dates
dates = pd.to_datetime(df['Time'], errors='coerce')
#create times
times = pd.to_timedelta(df['Time'].str.extract('(.*)\s+ago', expand=False))
#combine final datetimes
df['Time'] = (pd.datetime.now() - times).combine_first(dates).dt.floor('D')

print (df)
                   Publication       Time  \
0  The San Diego Union-Tribune 2017-11-12   
1                   Devon Live 2017-11-11   
2                 ABS-CBN News 2017-11-11   
3               New York Times 2017-11-02   

                                         Headline  
0  I am not opposed to new therapeutic modalities  
1  If you're looking for a bargain this Christmas  
2   Now, Thirdy has a chance to do something that  
3   Shepherds lead their sheep through the centre  

print (df['Time'])
0   2017-11-12
1   2017-11-11
2   2017-11-11
3   2017-11-02
Name: Time, dtype: datetime64[ns]

Upvotes: 3

Related Questions