Apache Spark : Make a multiple columns groupBy on a Dataset works withtin partition

Question

I determine the number of employees and enterprises of activities sectors, city by city:

|codeCommune|nomCommune          |regroupement|section|libelleAPE                                                                                                                       |nombreEntreprises|nombreSalaries|nombreActifs|
+-----------+--------------------+------------+-------+---------------------------------------------------------------------------------------------------------------------------------+-----------------+--------------+------------+
|14654      |Saint-Pierre-en-Auge|84.11Z      |O      |Administration publique générale                                                                                                 |3                |153.5         |169.5       |
|14654      |Saint-Pierre-en-Auge|16.24Z      |C      |Fabrication d'emballages en bois                                                                                                 |1                |149.5         |150.5       |
|14654      |Saint-Pierre-en-Auge|10.11Z      |C      |Transformation et conservation de la viande de boucherie                                                                         |1                |149.5         |150.5       |

with a grouping level (regroupement below) that is set by the user:

+-----------+--------------------+------------+-------+------------------------------------------------------------------------------------------------------------------------------------------+-----------------+--------------+------------+
|codeCommune|nomCommune          |regroupement|section|libelleAPE                                                                                                                                |nombreEntreprises|nombreSalaries|nombreActifs|
+-----------+--------------------+------------+-------+------------------------------------------------------------------------------------------------------------------------------------------+-----------------+--------------+------------+
|14654      |Saint-Pierre-en-Auge|10          |C      |Industries alimentaires                                                                                                                   |16               |208.0         |225.0       |
|14654      |Saint-Pierre-en-Auge|86          |Q      |Activités pour la santé humaine                                                                                                           |46               |169.5         |218.5       |
|14654      |Saint-Pierre-en-Auge|84          |O      |Administration publique et défense ; sécurité sociale obligatoire                                                                         |5                |153.5         |171.5       |

The job is done that way:

From a Dataset of enterprises and establishments, partitioned by a department code (roughly the two first characters of a city code), these columns are selected:
- city_code,
- city_name,
- grouping (the part of the activity code we are retaining: 84.11Z or 84),
- section (a code summarizing the sector of an activity: industrial, commercial, etc.),
- activity_description,
- siren (the enterprise identifier: an enterprise might have many establishments),
- number_of_workers,
- number_of_actives_people
a groupBy is done :

RelationalGroupedDataset group = enterprisesAndEstablishments
   .groupBy("city_code", "city_name", "grouping", "section", "activity_description");

I perform my calculations by an aggregation then:

group.agg(countDistinct("siren").as("nombreEntreprises"), 
   sum("number_of_workers").as("nombreSalaries"),
   sum("number_of_actives_people").as("nombreActifs"));

My problem is that the groupBy method doesn't care about the dataset partitions and is gathering its data from any partition of the dataset enterprisesAndEstablishments and sorts globally a massive amount of data,
when targeting only a part would be more efficient: all activities in this sample are in the partition [codeDepartement=14].

I would like it to respect these partitions and do this groupBy at their levels, to avoid shuffle.

I'm looking for the sortWithPartitions companion of groupBy. Something that would be called groupWithinPartitions but I don't find it.

What is the best way to achieve what I'm looking for,
or if there's no tool, what alternative should I choose ?

Apache Spark : Make a multiple columns groupBy on a Dataset works withtin partition

Answers (1)

Related Questions