Arjun Sehgal
Arjun Sehgal

Reputation: 59

Usage of groupBy in Spark

I an currently learning spark in python. I had a small question, in other languages like SQL we can simply group a table by specified columns and then perform further operations like sum, count, etc. on them. How do we do this in Spark?

I have schema like :

    [name:"ABC", city:"New York", money:"50"]
    [name:"DEF", city:"London", money:"10"]
    [name:"ABC", city:"New York", money:"30"]
    [name:"XYZ", city:"London", money:"20"]
    [name:"XYZ", city:"London", money:"100"]
    [name:"DEF", city:"London", money:"200"]

Let's say I want to group this by city, and then perform the sum of money for each name. Something like :

    New York ABC 80
    London DEF 210
    London XYZ 120

Upvotes: 2

Views: 1932

Answers (2)

Jeff
Jeff

Reputation: 2228

You can do this in a Pythonic way as well (or the SQL version @LostInOverflow posted):

grouped = df.groupby('city', 'name').sum('money')

It looks like your money column is strings, so you'll need to cast it as an int first (or load it up that way to begin with):

df = df.withColumn('money', df['money'].cast('int'))

And remember that dataframes are immutable, so both of these require you to assign them to an object (even if it's just back to df again), and then use show if you want to see the results.

Edit: I should add that you need to create a dataframe first. For your simple data it's almost the same as the SQL version posted, but you assign it to a dataframe object instead of registering it as a table:

df = sc.parallelize([
    {"name": "ABC", "city": "New York", "money":"50"},
    {"name": "DEF", "city": "London",   "money":"10"},
    {"name": "ABC", "city": "New York", "money":"30"},
    {"name": "XYZ", "city": "London",   "money":"20"},
    {"name": "XYZ", "city": "London",   "money":"100"},
    {"name": "DEF", "city": "London",   "money":"200"},
    ]).toDF()

Upvotes: 2

user6022341
user6022341

Reputation:

You can use SQL:

>>> sc.parallelize([
... {"name": "ABC", "city": "New York", "money":"50"},
... {"name": "DEF", "city": "London",   "money":"10"},
... {"name": "ABC", "city": "New York", "money":"30"},
... {"name": "XYZ", "city": "London",   "money":"20"},
... {"name": "XYZ", "city": "London",   "money":"100"},
... {"name": "DEF", "city": "London",   "money":"200"},
... ]).toDF().registerTempTable("df")

>>> sqlContext.sql("""SELECT name, city, sum(cast(money as bigint)) AS total 
... FROM df GROUP name, city""")

Upvotes: 2

Related Questions