Reputation: 3365
I'm learning Pyspark, I used it to read a csv file into a dataframe (with column names):
>>> example_df.show(n=5)
+---------------------+------+-------------+---------+---------+-------------+------------+
| id| price| street| locality|town_city| district| country|
+---------------------+------+-------------+---------+---------+-------------+------------+
| bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
| aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
| bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
| aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
| cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
+---------------------+------+-------------+---------+---------+-------------+------------+
only showing top 5 rows
For the same location (= same value for street|locality|town_city|district|country
), I want to create a new location_id
, and group the corresponding fields, something like this:
+-----------+---------------------+------+-------------+---------+---------+-------------+------------+
|location_id| id| price| street| locality|town_city| district| country|
+-----------+---------------------+------+-------------+---------+---------+-------------+------------+
| 0| aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
| 0| aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
| 1| bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
| 1| bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
| 2| cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
+-----------+---------------------+------+-------------+---------+---------+-------------+------------+
only showing top 5 rows
...
I know how to create the new column:
from pyspark.sql.functions import monotonically_increasing_id
df = df.WithColumn("location_id", monotonically_increasing_id())
This just create a new column with 0,1,2 etc
, but how can I group the data for the same location and also give them unique location_id
? Thanks.
Upvotes: 0
Views: 259
Reputation: 14905
If you need just an id for each location you can use hash:
from pyspark.sql import functions as F
example_df.withColumn('location_id',
F.hash('street','locality','town_city','district','country')) \
.show()
Output:
+-----+------+-------------+---------+---------+-------------+------------+-----------+
| id| price| street| locality|town_city| district| country|location_id|
+-----+------+-------------+---------+---------+-------------+------------+-----------+
|bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| 406601501|
|aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| -97363104|
|bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| 406601501|
|aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| -97363104|
|cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| 1396052469|
+-----+------+-------------+---------+---------+-------------+------------+-----------+
Edit: please remember that the hash
function might cause hash collisions
Upvotes: 3