Reputation: 681
I just switched to PySpark dataframe from Pandas and found out that printing out the same column in PySpark dataframe gives wrong values. Here's an example: Using Pandas:
df_pandas=pd.read_csv("crime.csv", low_memory=False)
print(df_pandas["CRIMEID"].head(5))
Output:
1321797
1344185
1181882
1182632
1195867
Whereas using PySpark dataframe:
df_spark = sqlContext.read.format('csv').options(header='true', inferSchema='true').load('crime.csv')
df_spark.select("CRIMEID").show(5)
Output:
+-------+
|CRIMEID|
+-------+
|1321797|
| null|
| null|
|1344185|
| null|
+-------+
I haven't dropped any null rows either. Could somebody explain why that happens? I would really appreciate some help.
Upvotes: 0
Views: 1406
Reputation: 994
Here's what is happening:
In a distributed framework like Spark where data is divided and distributed across the cluster, shuffling of data is sure to occur.
So to sum it up, Spark is not giving you wrong values, it's just that it is returning you the records in a random order which is different than what you are getting from pandas.
Upvotes: 1