Reputation: 123
See the following minimal example:
import datetime
#this works
datetime.datetime(1973,1,23,0).timestamp()
#this produces OSError: [Errno 22] Invalid argument
datetime.datetime(1953,1,23,0).timestamp()
When I convert a Pandas dataframe with datetime64[ns] dates that are pre-epoch to a Apache Spark Dataframe, I get a bunch of warnings about Exception ignored in: 'pandas._libs.tslibs.tzconversion._tz_convert_tzlocal_utc'
(full stack trace below) and pre-epoch dates are changed to the epoch. Why does this happen and how do I prevent it?
Windows 10 Python: 3.7.6 pyspark 2.4.5 pandas 1.0.1
#imports
import pandas as pd
from datetime import datetime
from pyspark.sql import SparkSession
#set up spark
spark = SparkSession.builder.getOrCreate()
#create dataframe
df = pd.DataFrame({'Dates': [datetime(2019,3,29), datetime(1953,2,20)]})
#data types
df.dtypes
"""
Result:
Dates datetime64[ns]
dtype: object
"""
#try to convert to spark
sparkdf = spark.createDataFrame(df)
Exception ignored in: 'pandas._libs.tslibs.tzconversion._tz_convert_tzlocal_utc'
Traceback (most recent call last):
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\_common.py", line 144, in fromutc
return f(self, dt)
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\_common.py", line 258, in fromutc
dt_wall = self._fromutc(dt)
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\_common.py", line 222, in _fromutc
dtoff = dt.utcoffset()
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\tz.py", line 222, in utcoffset
if self._isdst(dt):
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\tz.py", line 291, in _isdst
dstval = self._naive_is_dst(dt)
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\tz.py", line 260, in _naive_is_dst
return time.localtime(timestamp + time.timezone).tm_isdst
OSError: [Errno 22] Invalid argument
sparkdf.show()
+-------------------+
| Dates|
+-------------------+
|2019-03-29 00:00:00|
|1970-01-01 00:00:00|
+-------------------+
sparkdf.printSchema()
root
|-- Dates: timestamp (nullable = true)
Upvotes: 3
Views: 1281
Reputation: 11
This is not an answer but may shed light on the situation. Even dates approaching 1970 01 01 00:00 give me the Errno 22. (I wish to resolve some very early epoch times, for data being logged before NTP time is available.) The earliest date that works as expected is somewhere around 1970 01 01 17 00
>>> import datetime
>>> a3 = datetime.datetime(1970,1,1,17,59)
>>> a1 = datetime.datetime(1970,1,1,17,0)
>>> a0 = datetime.datetime(1970,1,1,16,59)
>>> a3.timestamp()
89940.0
>>> a1.timestamp()
86400.0
>>> a0.timestamp()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument
>>>
My computer is in timezone MDT or GMT - 6 (MST with daylight savings)
Python newbie and First time answer writer! Windows 10. Python 3.7(64-bit). Anaconda nearby but not in use here.
Upvotes: 1