Reputation: 65
I have created a dataframe with two columns, titled "project_code" and "page_title". I want to count the number of rows in which "page_title" begins with "The" and has anything after this. To do this, I'm using a regex, which looks like "^The*".
Here's what I have so far:
val df = spark.read.textFile(“/pagecounts-20160101-000000”).map(l => {
val a = l.split(“ “)
(a(0), a(1))
}).toDF(“project_code”, “page_title”)
import scala.util.matching.Regex
val the = "^The*".r
From here, I'm unsure what to do. I think I have to perform a groupBy, using the regex as the criteria, and then aggregate. However I'm really unsure how to groupBy or aggregate using regex as criteria.
If you're looking for a sample of what the data looks like, you can find it here
Upvotes: 1
Views: 1764
Reputation: 27373
if you are interested only in the rows which match the criteria, use:
df
.where(col("page_title").rlike("^The.*"))
.count
if you want to see how many rows match and which don't:
df
.groupBy(col("page_title").rlike("^The.*"))
.count
.show()
Or alternatively:
df
.groupBy()
.agg(
count("*").as("total_count"),
count(when(col("page_title").rlike("^The.*"),lit(1))).as("match_count")
)
.show()
Upvotes: 1
Reputation: 31460
After converting to .toDF
you can use .startsWith
(or) .rlike
functions to filter
the matching rows from the dataframe.
Example:
spark.sparkContext.textFile("/pagecounts-20160101-000000").map(_.split(" ")).
map(c => (c(0),c(1))).
toDF("project_code","page_title").
withColumn("match", col("page_title").startsWith("The")).
filter(col("match")).
show()
//+------------+--------------------+-----+
//|project_code| page_title|match|
//+------------+--------------------+-----+
//| ab|The_DeFranco_Fami...| true|
//| ab|The_Great_War(195...| true|
//| af|The_Gods_Must_Be_...| true|
//| af| The_Guardian| true|
//| af| The_Killers| true|
//+------------+--------------------+-----+
//to count number of matching rows using .startsWith function
spark.sparkContext.textFile("/pagecounts-20160101-000000").map(_.split(" ")).
map(c => (c(0),c(1))).
toDF("project_code","page_title").
withColumn("match", col("page_title").startsWith("The")).
filter(col("match")).
count()
//(or) by using .rlike function
spark.sparkContext.textFile("/pagecounts-20160101-
000000").map(_.split(" ")).
map(c => (c(0),c(1))).
toDF("project_code","page_title").
withColumn("match", col("page_title").rlike("^The.*")).
filter(col("match")).
count()
//Long = 48684
Upvotes: 1