Reputation: 57
Using this query:
sql("SELECT _location, count(1) FROM tablaTemporal group by _location order by 2 desc" )
I receive this output:
+--------------------------------+--------+
|_location |count(1)|
+--------------------------------+--------+
|London, United Kingdom |15 |
|United States |12 |
|Bangalore, India |8 |
|Hyderabad, India |7 |
|Paris, France |6 |
|San Francisco, CA, United States|6 |
|Mountain View, CA, United States|4 |
|Pune, India |4 |
|Bengaluru, Karnataka, India |3 |
+--------------------------------+--------+
But the result I need is:
+--------------------------------+--------+
|_location |count(1)|
+--------------------------------+--------+
|United States |22 |
|India |22 |
|United Kingdom |15 |
|France |6 |
+--------------------------------+--------+
Hence, I need to use some sentence like:
sql("SELECT SubstringOfLocationFromCharComma(_location), count(1) FROM tablaTemporal group by _location order by 2 desc" )
How to extract the last element from a comma-delimited string?
Upvotes: 0
Views: 1244
Reputation: 10092
Since the name of the country is the last element after the comma, you can also do something like this:
df.show(false)
+--------------------------------+
|a |
+--------------------------------+
|Mountain View, CA, United States|
|Pune, India |
|Bengaluru, Karnataka, India |
+--------------------------------+
df.withColumn("a" , split($"a", ",") ).withColumn("a" , expr("a[ size(a) -1 ] ") ).show
+--------------+
|a |
+--------------+
| United States|
| India |
| India |
+--------------+
This will be followed by a groupBy($"a").agg(sum($"count(1)").as("count"))
to achieve the desired result.
Upvotes: 2
Reputation: 35229
You can use regexp_extract
import org.apache.spark.sql.functions._
val df = Seq(
"London, United Kingdom", "Bengaluru, Karnataka, India"
).toDF("_location")
df.select(regexp_extract($"_location", ".*,([^,]*)$", 1).alias("country")).show
// +---------------+
// | country|
// +---------------+
// | United Kingdom|
// | India|
// +---------------+
Upvotes: 0