Spark: How to Group by based on string pattern in Scala?

Question

I have a data frame :

[data :String, itemType:String, itemClass:String, itemGroup:String]

where the itemType, itemClass and itemGroup contains comma separated string. I exploded them and created a single row for each value.

df.withColumn("itemType", explode(split($"itemType", "[,]")))
   .withColumn("itemGroup", explode(split($"itemGroup", "[,]")))
   .withColumn("itemClass", explode(split($"itemClass", "[,]")))

I am trying to group by the values of itemType, itemGroup and itemClass.

df.groupBy($"itemType".contains("item class ")).count()

but this just gives me as true and null but not grouping by the pattern. Is there a way to group by most common pattern but not exact match.

Praveen L · Accepted Answer

You can group in this way based on regex. You need to write your own regex for your data and can group as below.

Here is one way of grouping with some sample data.

assume your dataframe id df and data looks like below.

Before grouping: df.show()

+----+----------+------------+-----------+
|data|  itemType|   itemGroup|  itemClass|
+----+----------+------------+-----------+
|   1|type1_dgdf| group1_flkk|class1_gdfg|
|   1|type1_jhgj| group1_fgfd|class1_grtt|
|   1|type1_657g| group1_6gfh|class1_342e|
|   1| type1_qer|  group2_wqw|class1_fgfv|
|   2|type2_seds|  group2_wqw|class2_fiuy|
|   2|  type2_65|group2_wuyuy|class2_232e|
|   2| type2_ffg| group2_wyty|class2_fere|
+----+----------+------------+-----------+

After grouping :

+--------+---------+---------+-----+
|itemType|itemGroup|itemClass|count|
+--------+---------+---------+-----+
|   type1|   group2|   class1|    1|
|   type2|   group2|   class2|    3|
|   type1|   group1|   class1|    3|
+--------+---------+---------+-----+

code:

import org.apache.spark.sql.functions.regexp_extract

df.groupBy(
            regexp_extract(df("itemType"), "(type\d).*$", 1).alias("itemType"),  // here pattern matches for starting with `type` and then any number using `\d` (\d for windows).. same logic for others as well.
            regexp_extract(df("itemGroup"), "(group\d).*$", 1).alias("itemGroup"),
            regexp_extract(df("itemClass"), "(class\d).*$", 1).alias("itemClass"))
            .count().show()

Spark: How to Group by based on string pattern in Scala?

Answers (1)

Related Questions