Jacob
Jacob

Reputation: 123

Error When Converting Pandas DataFrame with Dates to Spark Dataframe

EDIT: The issue appears to be in the standard datetime library's conversion conversion of datetimes to timestamps for pre-epoch datetimes in Windows

See the following minimal example:

import datetime

#this works
datetime.datetime(1973,1,23,0).timestamp()

#this produces OSError: [Errno 22] Invalid argument
datetime.datetime(1953,1,23,0).timestamp()

Issue

When I convert a Pandas dataframe with datetime64[ns] dates that are pre-epoch to a Apache Spark Dataframe, I get a bunch of warnings about Exception ignored in: 'pandas._libs.tslibs.tzconversion._tz_convert_tzlocal_utc' (full stack trace below) and pre-epoch dates are changed to the epoch. Why does this happen and how do I prevent it?

Software Versions

Windows 10 Python: 3.7.6 pyspark 2.4.5 pandas 1.0.1

Code to Reproduce

#imports
import pandas as pd
from datetime import datetime
from pyspark.sql import SparkSession

#set up spark
spark = SparkSession.builder.getOrCreate()

#create dataframe
df = pd.DataFrame({'Dates': [datetime(2019,3,29), datetime(1953,2,20)]})

#data types
df.dtypes

"""
Result:
Dates    datetime64[ns]
dtype: object
"""

#try to convert to spark
sparkdf = spark.createDataFrame(df)

Stack Trace

Exception ignored in: 'pandas._libs.tslibs.tzconversion._tz_convert_tzlocal_utc'
Traceback (most recent call last):
  File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\_common.py", line 144, in fromutc
    return f(self, dt)
  File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\_common.py", line 258, in fromutc
    dt_wall = self._fromutc(dt)
  File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\_common.py", line 222, in _fromutc
    dtoff = dt.utcoffset()
  File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\tz.py", line 222, in utcoffset
    if self._isdst(dt):
  File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\tz.py", line 291, in _isdst
    dstval = self._naive_is_dst(dt)
  File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\tz.py", line 260, in _naive_is_dst
    return time.localtime(timestamp + time.timezone).tm_isdst
OSError: [Errno 22] Invalid argument

Resulting Dataframe

sparkdf.show()
+-------------------+
|              Dates|
+-------------------+
|2019-03-29 00:00:00|
|1970-01-01 00:00:00|
+-------------------+

Data Types

sparkdf.printSchema()
root
 |-- Dates: timestamp (nullable = true)

Upvotes: 3

Views: 1281

Answers (1)

AndreasH
AndreasH

Reputation: 11

This is not an answer but may shed light on the situation. Even dates approaching 1970 01 01 00:00 give me the Errno 22. (I wish to resolve some very early epoch times, for data being logged before NTP time is available.) The earliest date that works as expected is somewhere around 1970 01 01 17 00

>>> import datetime
>>> a3 = datetime.datetime(1970,1,1,17,59)
>>> a1 = datetime.datetime(1970,1,1,17,0)
>>> a0 = datetime.datetime(1970,1,1,16,59)
>>> a3.timestamp()
89940.0
>>> a1.timestamp()
86400.0
>>> a0.timestamp()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument
>>>

My computer is in timezone MDT or GMT - 6 (MST with daylight savings)

Python newbie and First time answer writer! Windows 10. Python 3.7(64-bit). Anaconda nearby but not in use here.

Upvotes: 1

Related Questions