johnnydoe
johnnydoe

Reputation: 402

Splitting a column to create new columns when the number of elements varies a lot

I have a dataframe df and a column LongColumn that that has rows like this:

FIRST_WORD.ABC.approved.Address.Street.Phone.Email.Category1.Beta.MyResults.New
FIRST_WORD.DEF.Address.Street.Phone.Email.Category2.Alpha.MyResults.New
FIRST_WORD.ABC.Address.Street.Phone.Email.Category6.Gamma.MyResults.New
FIRST_WORD.GHI.approved.Address.Street.Phone.Email.Category1.Alpha.MyResults.New

I did the following to split and create new columns:

out_df=df.withColumn('LongColumn', F.split('LongColumn', '\.')).withColumn('Category', F.expr('LongColumn[size(LongColumn) - 4]')).withColumn('Letter', F.expr('LongColumn[size(LongColumn) - 3]')).withColumn('dataset', F.expr('LongColumn[size(LongColumn) - 2]'))

But I need two extra columns: Status which contains approved if the word approved appears there, and unapproved if it does not , and Code which contains the three letters on the second position.

The three latin letters are always on the same position (if I take it top-down), but no matter the direction, the approved is a problem. Is it possible to solve this similarly or do I need to modify the dataset?

Expected output (assuming I drop the LongColumn whose values are now arrays):

Code     Status        Category     Letter     dataset
ABC      approved      Category1    Beta       MyResults
DEF      unapproved    Category2    Alpha      MyResults
ABC      unapproved    Category6    Gamma      MyResults
GHI      approved      Category1    Alpha      MyResults

Upvotes: 0

Views: 56

Answers (1)

Deku07
Deku07

Reputation: 146

IMP Note
Spark version >=3.1


Solution
from pyspark.sql import functions as F

data=[("FIRST_WORD.ABC.approved.Address.Street.Phone.Email.Category1.Beta.MyResults.New",),
("FIRST_WORD.DEF.Address.Street.Phone.Email.Category2.Alpha.MyResults.New",),
("FIRST_WORD.ABC.Address.Street.Phone.Email.Category6.Gamma.MyResults.New",),
("FIRST_WORD.GHI.approved.Address.Street.Phone.Email.Category1.Alpha.MyResults.New",)]
schema=["LongColumn"]
df = spark.createDataFrame(data=data,schema=schema)
out_df=df.withColumn('LongColumn', F.split('LongColumn', '\.'))\
.withColumn('Category', F.expr('LongColumn[size(LongColumn) - 4]'))\
.withColumn('Letter', F.expr('LongColumn[size(LongColumn) - 3]'))\
.withColumn('dataset', F.expr('LongColumn[size(LongColumn) - 2]'))\
.withColumn('status_val',F.array(F.lit("approved"),F.lit("unapproved")))\
.withColumn("status",F.filter("LongColumn",lambda x:F.array_contains("status_val",x)))\
.withColumn("status",F.when(F.size(F.col("status"))==1,F.col("status")[0]).otherwise(F.lit("unapproved")))\
.drop("status_val")
out_df.show(10,0)

output
+--------------------------------------------------------------------------------------------+---------+------+---------+----------+
|LongColumn                                                                                  |Category |Letter|dataset  |status    |
+--------------------------------------------------------------------------------------------+---------+------+---------+----------+
|[FIRST_WORD, ABC, approved, Address, Street, Phone, Email, Category1, Beta, MyResults, New] |Category1|Beta  |MyResults|approved  |
|[FIRST_WORD, DEF, Address, Street, Phone, Email, Category2, Alpha, MyResults, New]          |Category2|Alpha |MyResults|unapproved|
|[FIRST_WORD, ABC, Address, Street, Phone, Email, Category6, Gamma, MyResults, New]          |Category6|Gamma |MyResults|unapproved|
|[FIRST_WORD, GHI, approved, Address, Street, Phone, Email, Category1, Alpha, MyResults, New]|Category1|Alpha |MyResults|approved  |
+--------------------------------------------------------------------------------------------+---------+------+---------+----------+

Do let me know if you have any questions around the same.

kindly upvote if you like my Solution  

Upvotes: 1

Related Questions