CODEWITHSUNDEEP

scalaapache-sparkapache-spark-sql

Reputation: 57

How to extract the last element from a comma-delimited string?

Using this query:

sql("SELECT _location, count(1) FROM tablaTemporal group by _location order by 2 desc" )

I receive this output:

+--------------------------------+--------+
|_location                       |count(1)|
+--------------------------------+--------+
|London, United Kingdom          |15      |
|United States                   |12      |
|Bangalore, India                |8       |
|Hyderabad, India                |7       |
|Paris, France                   |6       |
|San Francisco, CA, United States|6       |
|Mountain View, CA, United States|4       |
|Pune, India                     |4       |
|Bengaluru, Karnataka, India     |3       |
+--------------------------------+--------+

But the result I need is:

+--------------------------------+--------+
|_location                       |count(1)|
+--------------------------------+--------+
|United States                   |22      |
|India                           |22      | 
|United Kingdom                  |15      |
|France                          |6       |
+--------------------------------+--------+

Hence, I need to use some sentence like:

sql("SELECT SubstringOfLocationFromCharComma(_location), count(1) FROM tablaTemporal group by _location order by 2 desc" )

How to extract the last element from a comma-delimited string?

Upvotes: 0

Views: 1244

Answers (2)

philantrovert

Reputation: 10092

Since the name of the country is the last element after the comma, you can also do something like this:

df.show(false)
+--------------------------------+
|a                               |
+--------------------------------+
|Mountain View, CA, United States|
|Pune, India                     |
|Bengaluru, Karnataka, India     |
+--------------------------------+


df.withColumn("a" , split($"a", ",") ).withColumn("a" , expr("a[ size(a) -1 ] ") ).show
+--------------+
|a             |
+--------------+
| United States|
| India        |
| India        |
+--------------+

This will be followed by a groupBy($"a").agg(sum($"count(1)").as("count")) to achieve the desired result.

Upvotes: 2

Alper t. Turker

Alper t. Turker

Reputation: 35229

You can use regexp_extract

import org.apache.spark.sql.functions._

val df = Seq(
  "London, United Kingdom", "Bengaluru, Karnataka, India"
).toDF("_location")

df.select(regexp_extract($"_location", ".*,([^,]*)$", 1).alias("country")).show

// +---------------+
// |        country|
// +---------------+
// | United Kingdom|
// |          India|
// +---------------+

Upvotes: 0

Related Questions