Reputation: 67
I have currently a dataframe I created by scraping google news headlines. One of my columns is "Time", which refers to time of publication of an article.
Unfortunately, for recent articles, google news uses a "relative" date, e.g., 6 hours ago, or 1 day ago instead of Nov 1, 2017.
I really want to convert these relative dates to be consistent with the other entries (so they also say Nov 12, 2017, for example), but I have no idea where to even start on this.
My thoughts are to maybe make a variable which represents todays date, and then do some kind of search through the dataframe for stuff which doesn't match my format, and then to subtract those relative times with the current date. I would also have to make some sort of filter for stuff which has "hours ago" and just have those equal the current date.
I don't really want a solution but rather a general idea of what to read to try to solve this. Am I supposed to try using numpy?
Example of some rows:
Publication Time Headline
0 The San Diego Union-Tribune 6 hours ago I am not opposed to new therapeutic modalities...
1 Devon Live 13 hours ago If you're looking for a bargain this Christmas...
15 ABS-CBN News 1 day ago Now, Thirdy has a chance to do something that ...
26 New York Times Nov 2, 2017 Shepherds lead their sheep through the centre ...
Upvotes: 2
Views: 1004
Reputation: 21274
Your approach should work. Use Pandas Timedelta
to subtract relative dates from the current date.
For example, given your sample data as:
Publication;Time;Headline
The San Diego Union-Tribune;6 hours ago;I am not opposed to new therapeutic modalities
Devon Live;13 hours ago;If you're looking for a bargain this Christmas
ABS-CBN News;1 day ago;Now, Thirdy has a chance to do something that
New York Times;Nov 2, 2017;Shepherds lead their sheep through the centre
Read in the data from the clipboard (although you could just as easily substitute with read_csv()
or some other file format):
import pandas as pd
from datetime import datetime
df = pd.read_clipboard(sep=";")
For the dates that are already in date format, Pandas is smart enough to convert them with to_datetime()
:
absolute_date = pd.to_datetime(df.Time, errors="coerce")
absolute_date
0 NaT
1 NaT
2 NaT
3 2017-11-02
Name: Time, dtype: datetime64[ns]
For the relative dates, once we drop the "ago" part, they're basically in the right format to convert with pd.Timedelta
:
relative_date = (datetime.today() -
df.Time.str.extract("(.*) ago", expand=False).apply(pd.Timedelta))
relative_date
0 2017-11-11 17:05:54.143548
1 2017-11-11 10:05:54.143548
2 2017-11-10 23:05:54.143548
3 NaT
Name: Time, dtype: datetime64[ns]
Now fill in the respective NaN
values from each set, absolute and relative (updated to use combine_first()
, via Jezrael's answer):
date = relative_date.combine_first(absolute_date)
relative_date
0 2017-11-11 17:06:29.658925
1 2017-11-11 10:06:29.658925
2 2017-11-10 23:06:29.658925
3 2017-11-02 00:00:00.000000
Name: Time, dtype: datetime64[ns]
Finally, pull out just the date from the datetime:
date.dt.date
0 2017-11-11
1 2017-11-11
2 2017-11-10
3 2017-11-02
Name: Time, dtype: object
Upvotes: 2
Reputation: 862711
You can use to_datetime
with to_timedelta
first and then use combine_first
with floor
:
#create dates
dates = pd.to_datetime(df['Time'], errors='coerce')
#create times
times = pd.to_timedelta(df['Time'].str.extract('(.*)\s+ago', expand=False))
#combine final datetimes
df['Time'] = (pd.datetime.now() - times).combine_first(dates).dt.floor('D')
print (df)
Publication Time \
0 The San Diego Union-Tribune 2017-11-12
1 Devon Live 2017-11-11
2 ABS-CBN News 2017-11-11
3 New York Times 2017-11-02
Headline
0 I am not opposed to new therapeutic modalities
1 If you're looking for a bargain this Christmas
2 Now, Thirdy has a chance to do something that
3 Shepherds lead their sheep through the centre
print (df['Time'])
0 2017-11-12
1 2017-11-11
2 2017-11-11
3 2017-11-02
Name: Time, dtype: datetime64[ns]
Upvotes: 3