Reputation: 6149
Below is my sample dataframe for household things.
Here W represents Wooden G represents Glass and P represents Plastic, and different items are classified in that category. So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair
M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
('W-Chair',''),
('W-Shelf;G-Cup;P-Chair',''),
('G-Cup;P-ShowerCap;W-Board','')],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| |
| W-Chair| |
| W-Shelf;G-Cup;P-Chair| |
| G-Cup;P-ShowerCap;W-Board| |
+-----------------------------+-----+
I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.
df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)
Is there a better way to do this in pySpark
Expected output
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| W|
| W-Chair| W|
| W-Shelf;G-Cup;P-Chair| P|
| G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+
Thanks @mck - for the solution.
Update In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set
M = sqlContext.createDataFrame([('Wooden|Chair',''),
('Wooden|Cup;Glass|Chair',''),
('Wooden|Cup;Glass|Showercap;Plastic|Chair','') ],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair
from M
""")
display(df)
Result:
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+
Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+
If delimiter alone is changed,should we need to change any other values?
update - 2
I have got the solution for the above mentioned update.
For pipe delimiter we have to escape them using 4 \
Upvotes: 0
Views: 1002
Reputation: 42422
You can use regexp_extract
to extract the categories, and if no match is found, replace empty string with null using nullif
.
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair
from M
""")
df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup |W |
|W-Chair |W |
|W-Shelf;G-Cup;P-Chair |P |
|G-Cup;P-ShowerCap;W-Board |null |
+-----------------------------+-----+
Upvotes: 1