Reputation: 29159
I'm trying to create a spark dataframe of (one column DT
and one row with date of 2020-1-1
) manually.
DT
=======
2020-01-01
However, it got the error of list index out of range
?
spark = SparkSession.builder\
.master(f'spark://{IP}:7077')\
.config('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')\
.appName('g data')\
.getOrCreate()
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT'])
Traceback:
in brand_tagging_since_until(spark, since, until) ---> 81 dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT']) /usr/local/bin/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema) 746 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) 747 else: --> 748 rdd, schema = self._createFromLocal(map(prepare, data), schema) 749 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) 750 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) /usr/local/bin/spark/python/pyspark/sql/session.py in _createFromLocal(self, data, schema) 419 if isinstance(schema, (list, tuple)): 420 for i, name in enumerate(schema): --> 421 struct.fields[i].name = name 422 struct.names[i] = name 423 schema = struct
Upvotes: 0
Views: 2407
Reputation: 42332
A more straightforward way to create the dataframe without relying on pandas:
import pyspark.sql.functions as F
dates = spark.createDataFrame([['2020-01-01']], ['DT']) \
.withColumn('DT', F.col('DT').cast('timestamp'))
dates.show()
+-------------------+
| DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+
Upvotes: 1
Reputation: 4214
There are two issues here, though one is not surfaced in your example. Your immediate issue is that the constructor is expecting a ,
after the value in the tuple. But, just adding this naively will silently fail, as the constructor doesn't know what to do with a pandas Timestamp object.
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("timestamp").getOrCreate()
val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
data=[(val,)],
schema=["DT"]
).show()
+---+
| DT|
+---+
| []|
+---+
You'll want to convert this to a raw Python datetime object beforehand if you want to use the constructor like this.
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("timestamp").getOrCreate()
val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
data=[(val.to_pydatetime(),)],
schema=["DT"]
).show()
+-------------------+
| DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+
With that said, it's not clear to me where this is most cleanly documented. If you're curious, you can see this requirement in the Spark codebase, or in the source code docs.
If you pass a pandas DataFrame to the constructor, this is handled under the hood.
df = pd.DataFrame({"DT": [val]})
spark.createDataFrame(
data=df
).show()
+-------------------+
| DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+
Upvotes: 2