Ferex
Ferex

Reputation: 65

Scala/Spark - Counting the number of rows in a dataframe in which a field matches a regex

I have created a dataframe with two columns, titled "project_code" and "page_title". I want to count the number of rows in which "page_title" begins with "The" and has anything after this. To do this, I'm using a regex, which looks like "^The*".

Here's what I have so far:

val df = spark.read.textFile(“/pagecounts-20160101-000000”).map(l => {
    val a = l.split(“ “)
    (a(0), a(1))
}).toDF(“project_code”, “page_title”)

import scala.util.matching.Regex

val the = "^The*".r

From here, I'm unsure what to do. I think I have to perform a groupBy, using the regex as the criteria, and then aggregate. However I'm really unsure how to groupBy or aggregate using regex as criteria.

If you're looking for a sample of what the data looks like, you can find it here

Upvotes: 1

Views: 1764

Answers (2)

Raphael Roth
Raphael Roth

Reputation: 27373

if you are interested only in the rows which match the criteria, use:

df
 .where(col("page_title").rlike("^The.*"))
 .count

if you want to see how many rows match and which don't:

df
 .groupBy(col("page_title").rlike("^The.*"))
 .count
 .show()

Or alternatively:

df
.groupBy()
.agg(
    count("*").as("total_count"),
    count(when(col("page_title").rlike("^The.*"),lit(1))).as("match_count")
)
.show()

Upvotes: 1

notNull
notNull

Reputation: 31460

After converting to .toDF you can use .startsWith (or) .rlike functions to filter the matching rows from the dataframe.

Example:

spark.sparkContext.textFile("/pagecounts-20160101-000000").map(_.split(" ")).
map(c => (c(0),c(1))).
toDF("project_code","page_title").
withColumn("match", col("page_title").startsWith("The")).
filter(col("match")).
show()

//+------------+--------------------+-----+
//|project_code|          page_title|match|
//+------------+--------------------+-----+
//|          ab|The_DeFranco_Fami...| true|
//|          ab|The_Great_War(195...| true|
//|          af|The_Gods_Must_Be_...| true|
//|          af|        The_Guardian| true|
//|          af|         The_Killers| true|
//+------------+--------------------+-----+

//to count number of matching rows using .startsWith  function
spark.sparkContext.textFile("/pagecounts-20160101-000000").map(_.split(" ")).
map(c => (c(0),c(1))).
toDF("project_code","page_title").
withColumn("match", col("page_title").startsWith("The")).
filter(col("match")).
count()

//(or) by using .rlike function
spark.sparkContext.textFile("/pagecounts-20160101-
000000").map(_.split(" ")).
map(c => (c(0),c(1))).
toDF("project_code","page_title").
withColumn("match", col("page_title").rlike("^The.*")).
filter(col("match")).
count()

//Long = 48684

Upvotes: 1

Related Questions