Firstname
Firstname

Reputation: 365

Reading and grouping data to get count using python spark

I'm new to spark using python and I'm trying to do some basic stuff to get an understanding of python and spark.

I have a file like below -

empid||deptid||salary
1||10||500
2||10||200
3||20||300
4||20||400
5||20||100

I want to write a small python spark to read the print the count of employees in each department.

I've been working with databases and this is quite simple in a sql, but I'm trying to do this using python spark. I don't have a code to share as I'm completely new to python and spark, but wanted to understand how it works using a simple hands-on example

I've install pyspark and did some quick reading here https://spark.apache.org/docs/latest/quick-start.html

Form my understanding there are dataframes on which one can perform sql like group by, but not sure how to write a proper code

Upvotes: 1

Views: 1466

Answers (1)

MaFF
MaFF

Reputation: 10086

You can read the text file as a dataframe using :

df = spark.createDataFrame(
    sc.textFile("path/to/my/file").map(lambda l: l.split(',')),
    ["empid","deptid","salary"]
)

textFile loads the data sample as an RDD with only one column. Then we split each line through a map and convert it to a dataframe.

Starting from a python list of lists:

df = spark.createDataFrame(
    sc.parallelize([[1,10,500],
                    [2,10,200],
                    [3,20,300],
                    [4,20,400],
                    [5,20,100]]),
    ["empid","deptid","salary"]
)

df.show()

    +-----+------+------+
    |empid|deptid|salary|
    +-----+------+------+
    |    1|    10|   500|
    |    2|    10|   200|
    |    3|    20|   300|
    |    4|    20|   400|
    |    5|    20|   100|
    +-----+------+------+

Now to count the number of employees by department we'll use a groupBy and then use the count aggregation function:

df_agg = df.groupBy("deptid").count().show()

    +------+-----+
    |deptid|count|
    +------+-----+
    |    10|    2|
    |    20|    3|
    +------+-----+

For the max:

import pyspark.sql.functions as psf
df_agg.agg(psf.max("count")).show()

Upvotes: 3

Related Questions