Reputation: 3906
I have a dataframe as shown below. I want to rename columns based on regex patterns.
patterns = ["price-usd-([0-9]+)", "list_price_([0-9]+)", "price_per_([0-9]+)_units", "pricefor([0-9]+)", "([0-9]+)_plus_price", "break_price_([0-9]+)", "price_break_pricing_([a-z]+)"]
Based on the above patterns i want to rename columns in dataframe as below.
------------------------------------------------------------------------------------------------------------------------------------------
| item_name | price-usd-1 | break_price_7 | pricefor5 | price_per_9_units | price_break_pricing_a | 2_plus_price | list_price_8 |
------------------------------------------------------------------------------------------------------------------------------------------
| Samsung Z | 10000 | 5 | 9000 | 10 | 7000 | 4 | 21 |
| Moto G4 | 12000 | 10 | 10000 | 20 | 6000 | 3 | 43 |
| Mi 4i | 15000 | 8 | 12000 | 20 | 10000 | 5 | 25 |
| Moto G3 | 20000 | 5 | 18000 | 12 | 15000 | 10 | 15 |
------------------------------------------------------------------------------------------------------------------------------------------
Output:
----------------------------------------------------------------------------------------------------------------------
| item_name | price_1 | price_7 | price_5 | price_9 | price_a | price_2 | price_8 |
----------------------------------------------------------------------------------------------------------------------
| Samsung Z | 10000 | 5 | 9000 | 10 | 7000 | 4 | 21 |
| Moto G4 | 12000 | 10 | 10000 | 20 | 6000 | 3 | 43 |
| Mi 4i | 15000 | 8 | 12000 | 20 | 10000 | 5 | 25 |
| Moto G3 | 20000 | 5 | 18000 | 12 | 15000 | 10 | 15 |
----------------------------------------------------------------------------------------------------------------------
Upvotes: 0
Views: 221
Reputation: 26676
I would go your way. I would use regex to extract values and then rename.
Data
df=spark.createDataFrame ([('Samsung Z ', 10000 , 5 , 9000 , 10 , 7000 , 20 , 'amazon.com') ,
('Moto G4' , 12000 , 10 , 10000 , 20 , 6000 , 50 , 'ebay.com' ) ,
('Mi 4i ' , 15000 , 8 , 12000 , 20 , 10000 , 25 ,' deals.com') ,
( 'Moto G3' , 20000 , 5 , 18000 , 12 , 15000 , 30 , 'ebay.com' ) ] ,
('item_name' , ' price-usd-1' , 'break_price_7 ' , 'pricefor5 ' , 'price_per_9_units' , 'price_3' , 'price_break_pricing_a6' , '2_plus_price' ))
+----------+------------+--------------+-----------+-----------------+-------+----------------------+------------+
| item_name| price-usd-1|break_price_7 |pricefor5 |price_per_9_units|price_3|price_break_pricing_a6|2_plus_price|
+----------+------------+--------------+-----------+-----------------+-------+----------------------+------------+
|Samsung Z | 10000| 5| 9000| 10| 7000| 20| amazon.com|
| Moto G4| 12000| 10| 10000| 20| 6000| 50| ebay.com|
| Mi 4i | 15000| 8| 12000| 20| 10000| 25| deals.com|
| Moto G3| 20000| 5| 18000| 12| 15000| 30| ebay.com|
+----------+------------+--------------+-----------+-----------------+-------+----------------------+------------+
Solution
import re
x = ['_'.join(sorted(re.findall(r'price|\d', x),reverse=True)) for x in df.columns if x!='item_name']#extract price and digits into a list, and concat
df.toDF('item_name',*x).show()#Pass new names into df
+----------+-------+-------+-------+-------+-------+-------+----------+
| item_name|price_1|price_7|price_5|price_9|price_3|price_6| price_2|
+----------+-------+-------+-------+-------+-------+-------+----------+
|Samsung Z | 10000| 5| 9000| 10| 7000| 20|amazon.com|
| Moto G4| 12000| 10| 10000| 20| 6000| 50| ebay.com|
| Mi 4i | 15000| 8| 12000| 20| 10000| 25| deals.com|
| Moto G3| 20000| 5| 18000| 12| 15000| 30| ebay.com|
+----------+-------+-------+-------+-------+-------+-------+----------+
Upvotes: 1