user2768498
user2768498

Reputation: 155

Spark - How to count number of records by key

This is probably an easy problem but basically I have a dataset where I am to count the number of females for each country. Ultimately I want to group each count by the country but I am unsure of what to use for the value since there is not a count column in the dataset that I can use as the value in a groupByKey or reduceByKey. I thought of using a reduceByKey() but that requires a key-value pair and I only want to count the key and make a counter as the value. How do I go about this?

val lines = sc.textFile("/home/cloudera/desktop/file.txt")
val split_lines = lines.map(_.split(","))
val femaleOnly = split_lines.filter(x => x._10 == "Female")

Here is where I am stuck. The country is index 13 in the dataset also. The output should something look like this: (Australia, 201000) (America, 420000) etc Any help would be great. Thanks

Upvotes: 11

Views: 53112

Answers (3)

dpeacock
dpeacock

Reputation: 2747

You're nearly there! All you need is a countByValue:

val countOfFemalesByCountry = femaleOnly.map(_(13)).countByValue()
// Prints (Australia, 230), (America, 23242), etc.

(In your example, I assume you meant x(10) rather than x._10)

All together:

sc.textFile("/home/cloudera/desktop/file.txt")
    .map(_.split(","))
    .filter(x => x(10) == "Female")
    .map(_(13))
    .countByValue()

Upvotes: 17

oleksii
oleksii

Reputation: 35925

You can easily create a key, it doesn't have to be in the file/database. For example:

val countryGender = sc.textFile("/home/cloudera/desktop/file.txt")
                .map(_.split(","))
                .filter(x => x._10 == "Female")
                .map(x => (x._13, x._10))    // <<<< here you generate a new key
                .groupByKey();

Upvotes: 0

Francois G
Francois G

Reputation: 11985

Have you considered manipulating your RDD using the Dataframes API ?

It looks like you're loading a CSV file, which you can do with spark-csv.

Then it's a simple matter (if your CSV is titled with the obvious column names) of:

import com.databricks.spark.csv._

val countryGender = sqlContext.csvFile("/home/cloudera/desktop/file.txt") // already splits by field
  .filter($"gender" === "Female")
  .groupBy("country").count().show()

If you want to go deeper in this kind of manipulation, here's the guide: https://spark.apache.org/docs/latest/sql-programming-guide.html

Upvotes: 5

Related Questions