How to count how many uppercase & lowercase strings are within an array column of a given id

Question

I have data which looks like this:

|  date      |     id       |    name    |  version  |                              assetIDs                               |
|------------| -------------| -----------|-----------|---------------------------------------------------------------------| 
| 2021-02-11 | com.example1 |   example1 |   1.2.3   |["4er6d99j","43mdmdm","234ds234","w23432s","sdfdsfds","32erwer"]    |
| 2021-02-11 | com.example2 |   example2 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","werewer","MMM3333","dssdsssM"]      |
| 2021-02-11 | com.example3 |   example3 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","YYYY2222","mmmm3444","yy2222"]      |
| 2021-02-11 | com.example4 |   example4 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","222sdsss","fffff3333","ffffffff"]   |
| 2021-02-11 | com.example5 |   example5 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","Y222222","YYYYJJJJ2222","DJDDJSJ2"] |
| 2021-02-11 | com.example6 |   example6 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","Ydddddd","ssdfdfdfd","sdfdwsfw"]    |

I need to be able to have two more columns one lowerCaseCount, one upperCaseCount based on the strings in assetIDs.

The end result will be two additional columns with the lowercase and uppercase string amounts in each id.

I've started off by doing this but it's not giving me the count I expect:

with data as (
      select date, id, name, version, explode(assetIDs) as assetids 
      from assets_table 
)

select dt.data, dt.id, dt.name dt.version, count(UPPER(assetids)) as upperCount,  count(LOWER(assetids)) as lowerCount, as.assetids
        from data dt
        inner join assets_table as on dt.id = as.id
        group by 1,2,3,4,7

This is just returning me the count of strings so total amount in both uppercase and lower case columns

How can I do this the best way? I can do this in Scala, Python or/and using SQL as I'm doing it using databricks.

blackbishop · Accepted Answer

You can combine size and filter functions to get the counts :

spark.sql("""
     SELECT *, 
            size(filter(assetIDs, x -> x = lower(x))) as lowerCaseCount,
            size(filter(assetIDs, x -> x = upper(x))) as upperCaseCount
     FROM   assets_table
""").show(truncate=False)

#+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
#|date      |id          |name    |version|assetIDs                                                     |lowerCaseCount|upperCaseCount|
#+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
#|2021-02-11|com.example1|example1|1.2.3  |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer]    |6             |0             |
#|2021-02-11|com.example2|example2|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM]     |3             |2             |
#|2021-02-11|com.example3|example3|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222]     |4             |2             |
#|2021-02-11|com.example4|example4|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff]  |5             |1             |
#|2021-02-11|com.example5|example5|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2             |4             |
#|2021-02-11|com.example6|example6|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw]   |4             |1             |
#+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+

Or using DataFrame API:

from pyspark.sql import functions as F

df1 = df.withColumn("lowerCaseCount", F.expr("size(filter(assetIDs, x -> x = lower(x)))"))\
    .withColumn("upperCaseCount", F.expr("size(filter(assetIDs, x -> x = upper(x)))"))

df1.show(truncate=False)

How to count how many uppercase & lowercase strings are within an array column of a given id

Answers (2)

Related Questions

How to count how many uppercase &amp; lowercase strings are within an array column of a given id

Answers (2)

Related Questions

How to count how many uppercase & lowercase strings are within an array column of a given id