How can I group my data frame based on conditions on a column

Question

I have a data frame like this:

Date	Version	Value	Name
Jan 1	123.1	3	A
Jan 2	123.23	5	A
Jan 1	223.1	6	B
Jan 2	623.23	7	B

I want to group the table for 'Version' with the same prefix (everything from the first letter to there is the .. And for the Value, it selects the values using the row with the longest string length of version. And for the `Name' column, it uses any of the rows with the same prefix.

Version Prefix	Value	Name
123	5	A
223	6	B
623	7	B

Meaning version 123.1 and 123.23 has the same prefix '123', so both rows become 1 row in the result. And 'Value' equals to 5 since row with Version 123.23 (the row with the longest Version has 5 as Value.

wwnde · Accepted Answer

  (df.withColumn('Version Prefix', split('Version','\.')[0])#Create new column
 .withColumn('size', size(split(split('Version','\.')[1],'(?!$)')))#Calculate the size of the suffixes
 .withColumn('max', max('size').over(Window.partitionBy('Version Prefix','Name')))#Find the suffix with the maximum size
 .where(col('size')==col('max'))#Filter out max suffixes
 .drop('Date','size','max','Version')#Drop unwanted columns
).show()

+-----+----+--------------+
|Value|Name|Version Prefix|
+-----+----+--------------+
|    5|   A|           123|
|    6|   B|           223|
|    7|   B|           623|
+-----+----+--------------+

How can I group my data frame based on conditions on a column

Answers (1)

Related Questions