Reputation: 402
I have a dataframe df and a column LongColumn that that has rows like this:
FIRST_WORD.ABC.approved.Address.Street.Phone.Email.Category1.Beta.MyResults.New
FIRST_WORD.DEF.Address.Street.Phone.Email.Category2.Alpha.MyResults.New
FIRST_WORD.ABC.Address.Street.Phone.Email.Category6.Gamma.MyResults.New
FIRST_WORD.GHI.approved.Address.Street.Phone.Email.Category1.Alpha.MyResults.New
I did the following to split and create new columns:
out_df=df.withColumn('LongColumn', F.split('LongColumn', '\.')).withColumn('Category', F.expr('LongColumn[size(LongColumn) - 4]')).withColumn('Letter', F.expr('LongColumn[size(LongColumn) - 3]')).withColumn('dataset', F.expr('LongColumn[size(LongColumn) - 2]'))
But I need two extra columns: Status
which contains approved
if the word approved
appears there, and unapproved
if it does not , and Code
which contains the three letters on the second position.
The three latin letters are always on the same position (if I take it top-down), but no matter the direction, the approved
is a problem. Is it possible to solve this similarly or do I need to modify the dataset?
Expected output (assuming I drop the LongColumn
whose values are now arrays):
Code Status Category Letter dataset
ABC approved Category1 Beta MyResults
DEF unapproved Category2 Alpha MyResults
ABC unapproved Category6 Gamma MyResults
GHI approved Category1 Alpha MyResults
Upvotes: 0
Views: 56
Reputation: 146
IMP Note
Spark version >=3.1
Solution
from pyspark.sql import functions as F
data=[("FIRST_WORD.ABC.approved.Address.Street.Phone.Email.Category1.Beta.MyResults.New",),
("FIRST_WORD.DEF.Address.Street.Phone.Email.Category2.Alpha.MyResults.New",),
("FIRST_WORD.ABC.Address.Street.Phone.Email.Category6.Gamma.MyResults.New",),
("FIRST_WORD.GHI.approved.Address.Street.Phone.Email.Category1.Alpha.MyResults.New",)]
schema=["LongColumn"]
df = spark.createDataFrame(data=data,schema=schema)
out_df=df.withColumn('LongColumn', F.split('LongColumn', '\.'))\
.withColumn('Category', F.expr('LongColumn[size(LongColumn) - 4]'))\
.withColumn('Letter', F.expr('LongColumn[size(LongColumn) - 3]'))\
.withColumn('dataset', F.expr('LongColumn[size(LongColumn) - 2]'))\
.withColumn('status_val',F.array(F.lit("approved"),F.lit("unapproved")))\
.withColumn("status",F.filter("LongColumn",lambda x:F.array_contains("status_val",x)))\
.withColumn("status",F.when(F.size(F.col("status"))==1,F.col("status")[0]).otherwise(F.lit("unapproved")))\
.drop("status_val")
out_df.show(10,0)
output
+--------------------------------------------------------------------------------------------+---------+------+---------+----------+
|LongColumn |Category |Letter|dataset |status |
+--------------------------------------------------------------------------------------------+---------+------+---------+----------+
|[FIRST_WORD, ABC, approved, Address, Street, Phone, Email, Category1, Beta, MyResults, New] |Category1|Beta |MyResults|approved |
|[FIRST_WORD, DEF, Address, Street, Phone, Email, Category2, Alpha, MyResults, New] |Category2|Alpha |MyResults|unapproved|
|[FIRST_WORD, ABC, Address, Street, Phone, Email, Category6, Gamma, MyResults, New] |Category6|Gamma |MyResults|unapproved|
|[FIRST_WORD, GHI, approved, Address, Street, Phone, Email, Category1, Alpha, MyResults, New]|Category1|Alpha |MyResults|approved |
+--------------------------------------------------------------------------------------------+---------+------+---------+----------+
Do let me know if you have any questions around the same.
kindly upvote if you like my Solution
Upvotes: 1