PySpark dataframe shows wrong values

Question

I just switched to PySpark dataframe from Pandas and found out that printing out the same column in PySpark dataframe gives wrong values. Here's an example: Using Pandas:

df_pandas=pd.read_csv("crime.csv", low_memory=False)
print(df_pandas["CRIMEID"].head(5))

Output:

Whereas using PySpark dataframe:

df_spark = sqlContext.read.format('csv').options(header='true', inferSchema='true').load('crime.csv')
df_spark.select("CRIMEID").show(5)

Output:

+-------+
|CRIMEID|
+-------+
|1321797|
|   null|
|   null|
|1344185|
|   null|
+-------+

I haven't dropped any null rows either. Could somebody explain why that happens? I would really appreciate some help.

Abhishek Arora · Accepted Answer

Here's what is happening:

When you read a csv in Pandas, the order of the records are preserved. And since the pandas is not distributed and holds everything in memory, that order doesn't get change when you call the 'head' method on the pandas dataframe. Thus, the output you get is in the same order as it was when pandas read it from the csv.
On the other hand, a Spark dataframe also preserves the order when reading from ordered file (e.g. csv), but when you call an action method like 'show' on a Spark dataframe shuffling takes places and due to the nature of shuffling you may see random order of records returned.

In a distributed framework like Spark where data is divided and distributed across the cluster, shuffling of data is sure to occur.

So to sum it up, Spark is not giving you wrong values, it's just that it is returning you the records in a random order which is different than what you are getting from pandas.

PySpark dataframe shows wrong values

Answers (1)

Related Questions