Reputation: 1634
I have a data frame :
[data :String, itemType:String, itemClass:String, itemGroup:String]
where the itemType, itemClass and itemGroup contains comma separated string. I exploded them and created a single row for each value.
df.withColumn("itemType", explode(split($"itemType", "[,]")))
.withColumn("itemGroup", explode(split($"itemGroup", "[,]")))
.withColumn("itemClass", explode(split($"itemClass", "[,]")))
I am trying to group by the values of itemType, itemGroup and itemClass.
df.groupBy($"itemType".contains("item class ")).count()
but this just gives me as true and null but not grouping by the pattern. Is there a way to group by most common pattern but not exact match.
Upvotes: 0
Views: 1320
Reputation: 987
You can group in this way based on regex. You need to write your own regex for your data and can group as below.
Here is one way of grouping with some sample data.
assume your dataframe id df
and data looks like below.
Before grouping: df.show()
+----+----------+------------+-----------+
|data| itemType| itemGroup| itemClass|
+----+----------+------------+-----------+
| 1|type1_dgdf| group1_flkk|class1_gdfg|
| 1|type1_jhgj| group1_fgfd|class1_grtt|
| 1|type1_657g| group1_6gfh|class1_342e|
| 1| type1_qer| group2_wqw|class1_fgfv|
| 2|type2_seds| group2_wqw|class2_fiuy|
| 2| type2_65|group2_wuyuy|class2_232e|
| 2| type2_ffg| group2_wyty|class2_fere|
+----+----------+------------+-----------+
After grouping :
+--------+---------+---------+-----+
|itemType|itemGroup|itemClass|count|
+--------+---------+---------+-----+
| type1| group2| class1| 1|
| type2| group2| class2| 3|
| type1| group1| class1| 3|
+--------+---------+---------+-----+
code:
import org.apache.spark.sql.functions.regexp_extract
df.groupBy(
regexp_extract(df("itemType"), "(type\\d).*$", 1).alias("itemType"), // here pattern matches for starting with `type` and then any number using `\d` (\\d for windows).. same logic for others as well.
regexp_extract(df("itemGroup"), "(group\\d).*$", 1).alias("itemGroup"),
regexp_extract(df("itemClass"), "(class\\d).*$", 1).alias("itemClass"))
.count().show()
Upvotes: 1