Reputation: 1624
I hava a dataframe with Column :
df =
itemType count
it_shampoo 5
it_books 5
it_mm 5
{it_mm} 5
it_books it_books 5
{=it_books} it_books 5
I need to get :
itemType count
it_shampoo 5
it_books 5
it_mm 5
it_mm 5
it_books 5
it_books 5
How do I extract replaces the it_books it_books
, {=it_books} it_books
to it_books
. Item Type will always follow it_
Upvotes: 0
Views: 2274
Reputation: 8711
The below regex also works
scala> val df = Seq(("it_shampoo",5),
| ("it_books",5),
| ("it_mm",5),
| ("{it_mm}",5),
| ("it_books it_books",5),
| ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]
scala> df.select( regexp_replace('itemtype,""".*\b(\S+)\b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
| replaced|count|
+----------+-----+
|it_shampoo| 5|
| it_books| 5|
| it_mm| 5|
| it_mm| 5|
| it_books| 5|
| it_books| 5|
+----------+-----+
scala>
Upvotes: 0
Reputation: 1091
Try regex, ^.*?(it_[\w]+).*$
to itemType and replace with first captured group $1
.
Upvotes: 1