Dominik
Dominik

Reputation: 307

pyspark: count number of occurrences of distinct elements in lists

I have to following data:

data = {'date': ['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04', '2014-01-05', '2014-01-06'],
     'flat': ['A;A;B', 'D;P;E;P;P', 'H;X', 'P;Q;G', 'S;T;U', 'G;C;G']}

data['date'] = pd.to_datetime(data['date'])

data = pd.DataFrame(data)
data['date'] = pd.to_datetime(data['date'])
spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "500g") \
    .appName('my-pandasToSparkDF-app') \
    .getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.sparkContext.setLogLevel("OFF")

df=spark.createDataFrame(data)
new_frame = df.withColumn("list", F.split("flat", "\;"))

enter image description here

I would like to add a new column which holds the number of occurrences of each distinct element (sorted in ascending order) and another column which holds the maximum:

+-------------------+-----------+---------------------+-----------+----+
|               date| flat      | list                |occurrences|max |
+-------------------+-----------+---------------------+-----------+----+
|2014-01-01 00:00:00|A;A;B      |['A','A','B']        |[1,2]      |2   |
|2014-01-02 00:00:00|D;P;E;P;P  |['D','P','E','P','P']|[1,1,3]    |3   |
|2014-01-03 00:00:00|H;X        |['H','X']            |[1,1]      |1   |
|2014-01-04 00:00:00|P;Q;G      |['P','Q','G']        |[1,1,1]    |1   |
|2014-01-05 00:00:00|S;T;U      |['S','T','U']        |[1,1,1]    |1   |
|2014-01-06 00:00:00|G;C;G      |['G','C','G']        |[1,2]      |2   |  
+-------------------+-----------+---------------------+-----------+----+

Thank you very much!

Upvotes: 3

Views: 1630

Answers (2)

murtihash
murtihash

Reputation: 8410

For Spark2.4+ this can be achieved without multiple groupBys and aggregations(as they are expensive shuffle operations in big data). You can do this using one expression of higher order functions transform and aggregate. This should be the canonical solution for spark2.4.

from pyspark.sql import functions as F
df=spark.createDataFrame(data)
df.withColumn("list", F.split("flat","\;"))\
  .withColumn("occurances", F.expr("""array_sort(transform(array_distinct(list), x-> aggregate(list, 0,(acc,t)->acc+IF(t=x,1,0))))"""))\
  .withColumn("max", F.array_max("occurances"))\
  .show()
+-------------------+---------+---------------+----------+---+
|               date|     flat|           list|occurances|max|
+-------------------+---------+---------------+----------+---+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|    [1, 2]|  2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]| [1, 1, 3]|  3|
|2014-01-03 00:00:00|      H;X|         [H, X]|    [1, 1]|  1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]| [1, 1, 1]|  1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]| [1, 1, 1]|  1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|    [1, 2]|  2|
+-------------------+---------+---------------+----------+---+

Upvotes: 2

Sreeram TP
Sreeram TP

Reputation: 11937

You can do this by a couple of groupBy statements,

To start with you have a dataframe like this,

+-------------------+---------+---------------+
|               date|     flat|           list|
+-------------------+---------+---------------+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|
|2014-01-03 00:00:00|      H;X|         [H, X]|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|
+-------------------+---------+---------------+

Explode the list columns using F.explode like this,

new_frame_exp = new_frame.withColumn("exp", F.explode('list'))

Then, your dataframe will look like this,

+-------------------+---------+---------------+---+
|               date|     flat|           list|exp|
+-------------------+---------+---------------+---+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  A|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  A|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  B|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  D|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  E|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|
|2014-01-03 00:00:00|      H;X|         [H, X]|  H|
|2014-01-03 00:00:00|      H;X|         [H, X]|  X|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  P|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  Q|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  G|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  S|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  T|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  U|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  G|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  C|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  G|
+-------------------+---------+---------------+---+

On this dataframe, do a groupBy like this,

new_frame_exp_agg = new_frame_exp.groupBy('date', 'flat', 'list', 'exp').count()

Then you will have a dataframe like this,

+-------------------+---------+---------------+---+-----+
|               date|     flat|           list|exp|count|
+-------------------+---------+---------------+---+-----+
|2014-01-03 00:00:00|      H;X|         [H, X]|  H|    1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  G|    1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  U|    1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  T|    1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  P|    1|
|2014-01-03 00:00:00|      H;X|         [H, X]|  X|    1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  G|    2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  E|    1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  C|    1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  S|    1|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  B|    1|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  D|    1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  Q|    1|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  A|    2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|    3|
+-------------------+---------+---------------+---+-----+

On this dataframe, apply one more level of aggregation to collect the counts to list and find max like this,

res = new_frame_exp_agg.groupBy('date', 'flat', 'list').agg(
                                         F.collect_list('count').alias('occurances'),
                                         F.max('count').alias('max'))

res.orderBy('date').show()


+-------------------+---------+---------------+----------+---+
|               date|     flat|           list|occurances|max|
+-------------------+---------+---------------+----------+---+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|    [2, 1]|  2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]| [1, 1, 3]|  3|
|2014-01-03 00:00:00|      H;X|         [H, X]|    [1, 1]|  1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]| [1, 1, 1]|  1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]| [1, 1, 1]|  1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|    [1, 2]|  2|
+-------------------+---------+---------------+----------+---+

If you want the column occurance sorted, you can use F.array_sort over the column if you are on spark 2.4+ else you have to write a udf for that.

Upvotes: 0

Related Questions