moirK
moirK

Reputation: 681

PySpark dataframe shows wrong values

I just switched to PySpark dataframe from Pandas and found out that printing out the same column in PySpark dataframe gives wrong values. Here's an example: Using Pandas:

df_pandas=pd.read_csv("crime.csv", low_memory=False)
print(df_pandas["CRIMEID"].head(5))

Output:

1321797
1344185
1181882
1182632
1195867

Whereas using PySpark dataframe:

df_spark = sqlContext.read.format('csv').options(header='true', inferSchema='true').load('crime.csv')
df_spark.select("CRIMEID").show(5)

Output:

+-------+
|CRIMEID|
+-------+
|1321797|
|   null|
|   null|
|1344185|
|   null|
+-------+

I haven't dropped any null rows either. Could somebody explain why that happens? I would really appreciate some help.

Upvotes: 0

Views: 1406

Answers (1)

Abhishek Arora
Abhishek Arora

Reputation: 994

Here's what is happening:

  • When you read a csv in Pandas, the order of the records are preserved. And since the pandas is not distributed and holds everything in memory, that order doesn't get change when you call the 'head' method on the pandas dataframe. Thus, the output you get is in the same order as it was when pandas read it from the csv.
  • On the other hand, a Spark dataframe also preserves the order when reading from ordered file (e.g. csv), but when you call an action method like 'show' on a Spark dataframe shuffling takes places and due to the nature of shuffling you may see random order of records returned.

In a distributed framework like Spark where data is divided and distributed across the cluster, shuffling of data is sure to occur.

So to sum it up, Spark is not giving you wrong values, it's just that it is returning you the records in a random order which is different than what you are getting from pandas.

Upvotes: 1

Related Questions