nayak0765
nayak0765

Reputation: 193

Difference between pandas and spark in terms of 'in-memory' processing in python

I have learn about in-memory processing of spark which is an advantage over pandas. But I compare below pandas and spark program where they create dataframe and concact two columns. In both cases spark and pandas will do the processing in 'in-memory'as data should be in RAM for processing.So how spark gives an advantage in this scenario compared to pandas as both are processing in-memory? Also when we should go for spark and pandas?

spark :-

df=spark.createDataFrame([
        ("Red",1,"Apple",date(2021,1,1),''),
        ("Black",2,"Grape",date(2021,2,3),''),
        ("Yellow",3,"Banana",date(2022,2,4),'')
        ],schema="color string,sr_no long,fruit string,orderDate date,desc string")
df2 = df.withColumn("desc", concat(col("color"), col("fruit")))
print(df2.show())

pandas :-

data = {'color': ['Red', 'Black', 'Yellow'],
        'sr_no': ['1', '2', '3'],
        'fruit':['Apple','Grape','Banana'],
        'orderDate':['2021-01-01','2021-02-03','2022-02-04']
        }   
df = pd.DataFrame.from_dict(data)
df['desc']=df['color']+df['fruit']
print(df)

o/p:-

color,sr_no,fruit,orderDate,desc
Red,1,Apple,2021-01-01,RedApple
Black,2,Grape,2021-02-03,BlackGrape
Yellow,3,Banana|2022-02-04,YellowBanana

Upvotes: 2

Views: 730

Answers (1)

Maurice
Maurice

Reputation: 13092

(Py)Spark is designed for big datasets, i.e., starting at multiple Gigabytes up to Petabytes. Pandas natively can handle data that fits into your local memory at the time of writing, usually a few Gigabytes.

The costs in PySpark are complexity and money: you need a cluster of machines that needs to be managed. This is why it's often a good idea to stick to Pandas until you need more parallelization or process more data in a timeframe than can be handled through chunking locally.

Note that PySpark is not a drop-in replacement for pandas, there are some syntactical differences, but the code will look similar.

There's also the Dask library for Python that allows you to have distributed computing using a mostly pandas-compatible syntax.

Upvotes: 2

Related Questions