Pyspark: Converting a sample to Pandas Dataframe

Question

I trying to extract a sample from a dataframe (df_spark) with 100 million rows and converting it to a pandas dataframe using the below code:

df = df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11).collect().toPandas()

Unfortunately, I'm getting the following error:

AttributeError: 'list' object has no attribute 'toPandas'

I also tried to convert it to rdd and then to pandas and got the same error.

I'm wondering to know once I have the sample list what is the correct method to convert it to a pandas dataframe or a spark dataframe?

Pengshe · Accepted Answer

There is no need to call collect() here. The sample() function returns a DataFrame object and the code can be as simple as:

df = df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11).toPandas()

Answers (2)