ca9163d9
ca9163d9

Reputation: 29159

IndexError: list index out of range when manually creating a spark data frame?

I'm trying to create a spark dataframe of (one column DT and one row with date of 2020-1-1) manually.

DT
=======
2020-01-01

However, it got the error of list index out of range?

spark = SparkSession.builder\
        .master(f'spark://{IP}:7077')\
        .config('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')\
        .appName('g data')\
        .getOrCreate()

spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT'])

Traceback:

 in brand_tagging_since_until(spark, since, until)
---> 81         dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT'])

/usr/local/bin/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    746             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    747         else:
--> 748             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    749         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    750         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/bin/spark/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    419             if isinstance(schema, (list, tuple)):
    420                 for i, name in enumerate(schema):
--> 421                     struct.fields[i].name = name
    422                     struct.names[i] = name
    423             schema = struct

Upvotes: 0

Views: 2407

Answers (2)

mck
mck

Reputation: 42332

A more straightforward way to create the dataframe without relying on pandas:

import pyspark.sql.functions as F

dates = spark.createDataFrame([['2020-01-01']], ['DT']) \
             .withColumn('DT', F.col('DT').cast('timestamp'))

dates.show()
+-------------------+
|                 DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+

Upvotes: 1

Nick Becker
Nick Becker

Reputation: 4214

There are two issues here, though one is not surfaced in your example. Your immediate issue is that the constructor is expecting a , after the value in the tuple. But, just adding this naively will silently fail, as the constructor doesn't know what to do with a pandas Timestamp object.

from pyspark.sql import SparkSession
import pandas as pd
​
spark = SparkSession.builder.appName("timestamp").getOrCreate()
​
val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
    data=[(val,)],
    schema=["DT"]
).show()
+---+
| DT|
+---+
| []|
+---+

You'll want to convert this to a raw Python datetime object beforehand if you want to use the constructor like this.

from pyspark.sql import SparkSession
import pandas as pd
​
spark = SparkSession.builder.appName("timestamp").getOrCreate()

val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
    data=[(val.to_pydatetime(),)],
    schema=["DT"]
).show()
+-------------------+
|                 DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+

With that said, it's not clear to me where this is most cleanly documented. If you're curious, you can see this requirement in the Spark codebase, or in the source code docs.

If you pass a pandas DataFrame to the constructor, this is handled under the hood.

df = pd.DataFrame({"DT": [val]})
spark.createDataFrame(
    data=df
).show()
+-------------------+
|                 DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+

Upvotes: 2

Related Questions