Reputation: 365
I'm new to spark using python and I'm trying to do some basic stuff to get an understanding of python and spark.
I have a file like below -
empid||deptid||salary
1||10||500
2||10||200
3||20||300
4||20||400
5||20||100
I want to write a small python spark to read the print the count of employees in each department.
I've been working with databases and this is quite simple in a sql, but I'm trying to do this using python spark. I don't have a code to share as I'm completely new to python and spark, but wanted to understand how it works using a simple hands-on example
I've install pyspark and did some quick reading here https://spark.apache.org/docs/latest/quick-start.html
Form my understanding there are dataframes on which one can perform sql like group by, but not sure how to write a proper code
Upvotes: 1
Views: 1466
Reputation: 10086
You can read the text file as a dataframe using :
df = spark.createDataFrame(
sc.textFile("path/to/my/file").map(lambda l: l.split(',')),
["empid","deptid","salary"]
)
textFile
loads the data sample as an RDD
with only one column. Then we split each line through a map
and convert it to a dataframe.
Starting from a python list of lists:
df = spark.createDataFrame(
sc.parallelize([[1,10,500],
[2,10,200],
[3,20,300],
[4,20,400],
[5,20,100]]),
["empid","deptid","salary"]
)
df.show()
+-----+------+------+
|empid|deptid|salary|
+-----+------+------+
| 1| 10| 500|
| 2| 10| 200|
| 3| 20| 300|
| 4| 20| 400|
| 5| 20| 100|
+-----+------+------+
Now to count the number of employees by department we'll use a groupBy
and then use the count
aggregation function:
df_agg = df.groupBy("deptid").count().show()
+------+-----+
|deptid|count|
+------+-----+
| 10| 2|
| 20| 3|
+------+-----+
For the max:
import pyspark.sql.functions as psf
df_agg.agg(psf.max("count")).show()
Upvotes: 3