n179911a
n179911a

Reputation: 155

How can I group my data frame based on conditions on a column

I have a data frame like this:

Date Version Value Name
Jan 1 123.1 3 A
Jan 2 123.23 5 A
Jan 1 223.1 6 B
Jan 2 623.23 7 B

I want to group the table for 'Version' with the same prefix (everything from the first letter to there is the .. And for the Value, it selects the values using the row with the longest string length of version. And for the `Name' column, it uses any of the rows with the same prefix.

Version Prefix Value Name
123 5 A
223 6 B
623 7 B

Meaning version 123.1 and 123.23 has the same prefix '123', so both rows become 1 row in the result. And 'Value' equals to 5 since row with Version 123.23 (the row with the longest Version has 5 as Value.

Upvotes: 1

Views: 29

Answers (1)

wwnde
wwnde

Reputation: 26676

  (df.withColumn('Version Prefix', split('Version','\.')[0])#Create new column
 .withColumn('size', size(split(split('Version','\.')[1],'(?!$)')))#Calculate the size of the suffixes
 .withColumn('max', max('size').over(Window.partitionBy('Version Prefix','Name')))#Find the suffix with the maximum size
 .where(col('size')==col('max'))#Filter out max suffixes
 .drop('Date','size','max','Version')#Drop unwanted columns
).show()

+-----+----+--------------+
|Value|Name|Version Prefix|
+-----+----+--------------+
|    5|   A|           123|
|    6|   B|           223|
|    7|   B|           623|
+-----+----+--------------+

Upvotes: 1

Related Questions