Reputation: 155
I have a data frame like this:
Date | Version | Value | Name |
---|---|---|---|
Jan 1 | 123.1 | 3 | A |
Jan 2 | 123.23 | 5 | A |
Jan 1 | 223.1 | 6 | B |
Jan 2 | 623.23 | 7 | B |
I want to group the table for 'Version' with the same prefix (everything from the first letter to there is the .
. And for the Value
, it selects the values using the row with the longest string length of version. And for the `Name' column, it uses any of the rows with the same prefix.
Version Prefix | Value | Name |
---|---|---|
123 | 5 | A |
223 | 6 | B |
623 | 7 | B |
Meaning version 123.1 and 123.23 has the same prefix '123', so both rows become 1 row in the result. And 'Value' equals to 5 since row with Version 123.23 (the row with the longest Version has 5 as Value.
Upvotes: 1
Views: 29
Reputation: 26676
(df.withColumn('Version Prefix', split('Version','\.')[0])#Create new column
.withColumn('size', size(split(split('Version','\.')[1],'(?!$)')))#Calculate the size of the suffixes
.withColumn('max', max('size').over(Window.partitionBy('Version Prefix','Name')))#Find the suffix with the maximum size
.where(col('size')==col('max'))#Filter out max suffixes
.drop('Date','size','max','Version')#Drop unwanted columns
).show()
+-----+----+--------------+
|Value|Name|Version Prefix|
+-----+----+--------------+
| 5| A| 123|
| 6| B| 223|
| 7| B| 623|
+-----+----+--------------+
Upvotes: 1