pierre_j
pierre_j

Reputation: 983

How to strip part of a string in a pandas dataframe in-place to have the result understood as a timestamp with timezone?

1. The problem

I have data in a csv file with timestamps. Here is an example of the format of these timestamps:

'Fri Aug 18 2017 00:00:00 GMT+0200 (heure d%u2019été d%u2019Europe centrale)'

Pandas does not seem to be able to process such a string formatting. Using the to_datetime function to check if pandas is able to process it, I have the following error:

In [3]: pd.to_datetime('Fri Aug 18 2017 00:00:00 GMT+0200 (heure d%u2019été d%u2019Europe centrale)')
Out[3]: raise ValueError("Unknown string format:", timestr)
ValueError: ('Unknown string format:', 'Fri Aug 18 2017 00:00:00 GMT+0200 (heure d%u2019été d%u2019Europe centrale)')

2. Logic identified

But if I remove the blabla, it appears to work:

In [4]: pd.to_datetime('Fri Aug 18 2017 00:00:00 GMT+0200')
Out[4]: Timestamp('2017-08-18 00:00:00-0200', tz='pytz.FixedOffset(-120)')

So the solution to have pandas understanding correctly this column as timestamps with time zones is 1st to strip the blabla:

' (heure d%u2019été d%u2019Europe centrale)'

Then to apply the to_datetime function.

3. Implementation

So, I tried to use the following lines of code, and it just is not working.

Dataframe is:

,date
0,Fri Aug 18 2017 00:00:00 GMT+0200 (heure d%u2019été d%u2019Europe centrale)
1,Sat Aug 19 2017 00:00:00 GMT+0200 (heure d%u2019été d%u2019Europe centrale)

Dataframe is stored as 'GC' I tried:

for ts in GC[1][:]:
    ts.replace(' (heure d%u2019été d%u2019Europe centrale)')
    ts.to_datetime()

I have this output (not very clear to me - I am using spyder, this may be not very helping)

File "pandas\_libs\hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 1

Please, what would be the correct approach? I thank you in advance for your help.

Have a good day,

Bests,

Pierrot

Upvotes: 1

Views: 67

Answers (1)

jezrael
jezrael

Reputation: 863166

Use Series.str.replace with column date and regex=False for avoid read () like regex:

s = df['date'].str.replace(' (heure d%u2019été d%u2019Europe centrale)', '', regex=False)
df['date'] = pd.to_datetime(s)

Or you can split values by \) with escape ) because regex special character:

df['date'] = pd.to_datetime(df['date'].str.split(' \(').str[0])

Upvotes: 1

Related Questions