Reputation: 59
I an currently learning spark in python. I had a small question, in other languages like SQL we can simply group a table by specified columns and then perform further operations like sum, count, etc. on them. How do we do this in Spark?
I have schema like :
[name:"ABC", city:"New York", money:"50"]
[name:"DEF", city:"London", money:"10"]
[name:"ABC", city:"New York", money:"30"]
[name:"XYZ", city:"London", money:"20"]
[name:"XYZ", city:"London", money:"100"]
[name:"DEF", city:"London", money:"200"]
Let's say I want to group this by city, and then perform the sum of money for each name. Something like :
New York ABC 80
London DEF 210
London XYZ 120
Upvotes: 2
Views: 1932
Reputation: 2228
You can do this in a Pythonic way as well (or the SQL version @LostInOverflow posted):
grouped = df.groupby('city', 'name').sum('money')
It looks like your money
column is strings, so you'll need to cast it as an int
first (or load it up that way to begin with):
df = df.withColumn('money', df['money'].cast('int'))
And remember that dataframes are immutable, so both of these require you to assign them to an object (even if it's just back to df
again), and then use show
if you want to see the results.
Edit: I should add that you need to create a dataframe first. For your simple data it's almost the same as the SQL version posted, but you assign it to a dataframe object instead of registering it as a table:
df = sc.parallelize([
{"name": "ABC", "city": "New York", "money":"50"},
{"name": "DEF", "city": "London", "money":"10"},
{"name": "ABC", "city": "New York", "money":"30"},
{"name": "XYZ", "city": "London", "money":"20"},
{"name": "XYZ", "city": "London", "money":"100"},
{"name": "DEF", "city": "London", "money":"200"},
]).toDF()
Upvotes: 2
Reputation:
You can use SQL:
>>> sc.parallelize([
... {"name": "ABC", "city": "New York", "money":"50"},
... {"name": "DEF", "city": "London", "money":"10"},
... {"name": "ABC", "city": "New York", "money":"30"},
... {"name": "XYZ", "city": "London", "money":"20"},
... {"name": "XYZ", "city": "London", "money":"100"},
... {"name": "DEF", "city": "London", "money":"200"},
... ]).toDF().registerTempTable("df")
>>> sqlContext.sql("""SELECT name, city, sum(cast(money as bigint)) AS total
... FROM df GROUP name, city""")
Upvotes: 2