Sam
Sam

Reputation: 1229

How to count how many uppercase & lowercase strings are within an array column of a given id

I have data which looks like this:

|  date      |     id       |    name    |  version  |                              assetIDs                               |
|------------| -------------| -----------|-----------|---------------------------------------------------------------------| 
| 2021-02-11 | com.example1 |   example1 |   1.2.3   |["4er6d99j","43mdmdm","234ds234","w23432s","sdfdsfds","32erwer"]    |
| 2021-02-11 | com.example2 |   example2 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","werewer","MMM3333","dssdsssM"]      |
| 2021-02-11 | com.example3 |   example3 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","YYYY2222","mmmm3444","yy2222"]      |
| 2021-02-11 | com.example4 |   example4 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","222sdsss","fffff3333","ffffffff"]   |
| 2021-02-11 | com.example5 |   example5 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","Y222222","YYYYJJJJ2222","DJDDJSJ2"] |
| 2021-02-11 | com.example6 |   example6 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","Ydddddd","ssdfdfdfd","sdfdwsfw"]    |

I need to be able to have two more columns one lowerCaseCount, one upperCaseCount based on the strings in assetIDs.

The end result will be two additional columns with the lowercase and uppercase string amounts in each id.

I've started off by doing this but it's not giving me the count I expect:

with data as (
      select date, id, name, version, explode(assetIDs) as assetids 
      from assets_table 
)

select dt.data, dt.id, dt.name dt.version, count(UPPER(assetids)) as upperCount,  count(LOWER(assetids)) as lowerCount, as.assetids
        from data dt
        inner join assets_table as on dt.id = as.id
        group by 1,2,3,4,7

This is just returning me the count of strings so total amount in both uppercase and lower case columns

How can I do this the best way? I can do this in Scala, Python or/and using SQL as I'm doing it using databricks.

Upvotes: 0

Views: 734

Answers (2)

blackbishop
blackbishop

Reputation: 32680

You can combine size and filter functions to get the counts :

spark.sql("""
     SELECT *, 
            size(filter(assetIDs, x -> x = lower(x))) as lowerCaseCount,
            size(filter(assetIDs, x -> x = upper(x))) as upperCaseCount
     FROM   assets_table
""").show(truncate=False)

#+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
#|date      |id          |name    |version|assetIDs                                                     |lowerCaseCount|upperCaseCount|
#+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
#|2021-02-11|com.example1|example1|1.2.3  |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer]    |6             |0             |
#|2021-02-11|com.example2|example2|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM]     |3             |2             |
#|2021-02-11|com.example3|example3|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222]     |4             |2             |
#|2021-02-11|com.example4|example4|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff]  |5             |1             |
#|2021-02-11|com.example5|example5|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2             |4             |
#|2021-02-11|com.example6|example6|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw]   |4             |1             |
#+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+

Or using DataFrame API:

from pyspark.sql import functions as F

df1 = df.withColumn("lowerCaseCount", F.expr("size(filter(assetIDs, x -> x = lower(x)))"))\
    .withColumn("upperCaseCount", F.expr("size(filter(assetIDs, x -> x = upper(x)))"))

df1.show(truncate=False)

Upvotes: 1

mck
mck

Reputation: 42352

You can do a conditional aggregation. I'm not sure how you want to handle strings with mixed upper and lower case though.

df2 = df.selectExpr(
    "*",
    "aggregate(assetIDs, 0, (acc, x) -> acc + case when lower(x) = x then 1 else 0 end) as lowerCaseCount",
    "aggregate(assetIDs, 0, (acc, x) -> acc + case when upper(x) = x then 1 else 0 end) as upperCaseCount"
)

df2.show(truncate=False)
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|date      |id          |name    |version|assetIDs                                                     |lowerCaseCount|upperCaseCount|
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|2021-02-11|com.example1|example1|1.2.3  |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer]    |6             |0             |
|2021-02-11|com.example2|example2|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM]     |3             |2             |
|2021-02-11|com.example3|example3|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222]     |4             |2             |
|2021-02-11|com.example4|example4|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff]  |5             |1             |
|2021-02-11|com.example5|example5|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2             |4             |
|2021-02-11|com.example6|example6|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw]   |4             |1             |
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+

Another way of counting:

df2 = df.selectExpr(
    "*",
    "aggregate(assetIDs, 0, (acc, x) -> acc + case when x rlike '[a-z]' then 1 else 0 end) as lowerCaseCount",
    "aggregate(assetIDs, 0, (acc, x) -> acc + case when x rlike '[A-Z]' then 1 else 0 end) as upperCaseCount"
)

df2.show(truncate=False)
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|date      |id          |name    |version|assetIDs                                                     |lowerCaseCount|upperCaseCount|
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|2021-02-11|com.example1|example1|1.2.3  |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer]    |6             |0             |
|2021-02-11|com.example2|example2|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM]     |4             |3             |
|2021-02-11|com.example3|example3|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222]     |4             |2             |
|2021-02-11|com.example4|example4|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff]  |5             |1             |
|2021-02-11|com.example5|example5|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2             |4             |
|2021-02-11|com.example6|example6|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw]   |5             |2             |
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+

Upvotes: 2

Related Questions