Reputation: 1229
I have data which looks like this:
| date | id | name | version | assetIDs |
|------------| -------------| -----------|-----------|---------------------------------------------------------------------|
| 2021-02-11 | com.example1 | example1 | 1.2.3 |["4er6d99j","43mdmdm","234ds234","w23432s","sdfdsfds","32erwer"] |
| 2021-02-11 | com.example2 | example2 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","werewer","MMM3333","dssdsssM"] |
| 2021-02-11 | com.example3 | example3 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","YYYY2222","mmmm3444","yy2222"] |
| 2021-02-11 | com.example4 | example4 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","222sdsss","fffff3333","ffffffff"] |
| 2021-02-11 | com.example5 | example5 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","Y222222","YYYYJJJJ2222","DJDDJSJ2"] |
| 2021-02-11 | com.example6 | example6 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","Ydddddd","ssdfdfdfd","sdfdwsfw"] |
I need to be able to have two more columns one lowerCaseCount
, one upperCaseCount
based on the strings in assetIDs
.
The end result will be two additional columns with the lowercase and uppercase string amounts in each id.
I've started off by doing this but it's not giving me the count I expect:
with data as (
select date, id, name, version, explode(assetIDs) as assetids
from assets_table
)
select dt.data, dt.id, dt.name dt.version, count(UPPER(assetids)) as upperCount, count(LOWER(assetids)) as lowerCount, as.assetids
from data dt
inner join assets_table as on dt.id = as.id
group by 1,2,3,4,7
This is just returning me the count of strings so total amount in both uppercase and lower case columns
How can I do this the best way? I can do this in Scala, Python or/and using SQL as I'm doing it using databricks.
Upvotes: 0
Views: 734
Reputation: 32680
You can combine size
and filter
functions to get the counts :
spark.sql("""
SELECT *,
size(filter(assetIDs, x -> x = lower(x))) as lowerCaseCount,
size(filter(assetIDs, x -> x = upper(x))) as upperCaseCount
FROM assets_table
""").show(truncate=False)
#+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
#|date |id |name |version|assetIDs |lowerCaseCount|upperCaseCount|
#+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
#|2021-02-11|com.example1|example1|1.2.3 |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer] |6 |0 |
#|2021-02-11|com.example2|example2|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM] |3 |2 |
#|2021-02-11|com.example3|example3|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222] |4 |2 |
#|2021-02-11|com.example4|example4|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff] |5 |1 |
#|2021-02-11|com.example5|example5|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2 |4 |
#|2021-02-11|com.example6|example6|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw] |4 |1 |
#+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
Or using DataFrame API:
from pyspark.sql import functions as F
df1 = df.withColumn("lowerCaseCount", F.expr("size(filter(assetIDs, x -> x = lower(x)))"))\
.withColumn("upperCaseCount", F.expr("size(filter(assetIDs, x -> x = upper(x)))"))
df1.show(truncate=False)
Upvotes: 1
Reputation: 42352
You can do a conditional aggregation. I'm not sure how you want to handle strings with mixed upper and lower case though.
df2 = df.selectExpr(
"*",
"aggregate(assetIDs, 0, (acc, x) -> acc + case when lower(x) = x then 1 else 0 end) as lowerCaseCount",
"aggregate(assetIDs, 0, (acc, x) -> acc + case when upper(x) = x then 1 else 0 end) as upperCaseCount"
)
df2.show(truncate=False)
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|date |id |name |version|assetIDs |lowerCaseCount|upperCaseCount|
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|2021-02-11|com.example1|example1|1.2.3 |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer] |6 |0 |
|2021-02-11|com.example2|example2|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM] |3 |2 |
|2021-02-11|com.example3|example3|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222] |4 |2 |
|2021-02-11|com.example4|example4|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff] |5 |1 |
|2021-02-11|com.example5|example5|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2 |4 |
|2021-02-11|com.example6|example6|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw] |4 |1 |
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
Another way of counting:
df2 = df.selectExpr(
"*",
"aggregate(assetIDs, 0, (acc, x) -> acc + case when x rlike '[a-z]' then 1 else 0 end) as lowerCaseCount",
"aggregate(assetIDs, 0, (acc, x) -> acc + case when x rlike '[A-Z]' then 1 else 0 end) as upperCaseCount"
)
df2.show(truncate=False)
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|date |id |name |version|assetIDs |lowerCaseCount|upperCaseCount|
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|2021-02-11|com.example1|example1|1.2.3 |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer] |6 |0 |
|2021-02-11|com.example2|example2|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM] |4 |3 |
|2021-02-11|com.example3|example3|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222] |4 |2 |
|2021-02-11|com.example4|example4|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff] |5 |1 |
|2021-02-11|com.example5|example5|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2 |4 |
|2021-02-11|com.example6|example6|2.3.4 |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw] |5 |2 |
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
Upvotes: 2