Scala/Spark - Counting the number of rows in a dataframe in which a field matches a regex

Question

I have created a dataframe with two columns, titled "project_code" and "page_title". I want to count the number of rows in which "page_title" begins with "The" and has anything after this. To do this, I'm using a regex, which looks like "^The*".

Here's what I have so far:

val df = spark.read.textFile(“/pagecounts-20160101-000000”).map(l => {
    val a = l.split(“ “)
    (a(0), a(1))
}).toDF(“project_code”, “page_title”)

import scala.util.matching.Regex

val the = "^The*".r

From here, I'm unsure what to do. I think I have to perform a groupBy, using the regex as the criteria, and then aggregate. However I'm really unsure how to groupBy or aggregate using regex as criteria.

If you're looking for a sample of what the data looks like, you can find it here

notNull · Accepted Answer

After converting to .toDF you can use .startsWith (or) .rlike functions to filter the matching rows from the dataframe.

Example:

spark.sparkContext.textFile("/pagecounts-20160101-000000").map(_.split(" ")).
map(c => (c(0),c(1))).
toDF("project_code","page_title").
withColumn("match", col("page_title").startsWith("The")).
filter(col("match")).
show()

//+------------+--------------------+-----+
//|project_code|          page_title|match|
//+------------+--------------------+-----+
//|          ab|The_DeFranco_Fami...| true|
//|          ab|The_Great_War(195...| true|
//|          af|The_Gods_Must_Be_...| true|
//|          af|        The_Guardian| true|
//|          af|         The_Killers| true|
//+------------+--------------------+-----+

//to count number of matching rows using .startsWith  function
spark.sparkContext.textFile("/pagecounts-20160101-000000").map(_.split(" ")).
map(c => (c(0),c(1))).
toDF("project_code","page_title").
withColumn("match", col("page_title").startsWith("The")).
filter(col("match")).
count()

//(or) by using .rlike function
spark.sparkContext.textFile("/pagecounts-20160101-
000000").map(_.split(" ")).
map(c => (c(0),c(1))).
toDF("project_code","page_title").
withColumn("match", col("page_title").rlike("^The.*")).
filter(col("match")).
count()

//Long = 48684

Scala/Spark - Counting the number of rows in a dataframe in which a field matches a regex

Answers (2)

Related Questions