Reputation: 523
I want to do groupby with respect of a row:
col1 col2 col3 col4 col5
'A' 'B' 'B' 'A' 'B'
2 4 3 2 1
0 1 2 4 0
1 1 1 1 1
I want to filter this dataframe if two columns have the same values in the first row and put the average of them.
For example the first and the forth columns have the same value 'A` in the first row and the average of those row is :
first row: (2+2)/2=2
second row: (0+4)/2=2
third row: (1+1)/2=1
the same for columns with value B
. so the output is
col1 col2
'A' 'B'
2 7/3
2 3/3
1 3/3
Upvotes: 0
Views: 103
Reputation: 9308
As I mentioned in comment, having string and integer mixed in in a single column is not as practical. It is better to have the letter row as part of the column name or pivot this dataframe to have the letter row as a column.
I will show how you can make the letter row as part of column name and get average from there.
df = spark.createDataFrame([
['A', 'B', 'B', 'A', 'B'],
['2', '4', '3', '2', '1'],
['0', '1', '2', '4', '0'],
['1', '1', '1', '1', '1']
], ['col1', 'col2', 'col3', 'col4', 'col5'])
letter_row = df.filter(df.col1.rlike('[^\d]')).take(1)[0]
new_cols = [f'{letter_row[x]}_{x}' for x in letter_row.asDict()]
df = df.filter(df.col1.rlike('\d+')).toDF(*new_cols)
# df.show()
# +------+------+------+------+------+
# |A_col1|B_col2|B_col3|A_col4|B_col5|
# +------+------+------+------+------+
# | 2| 4| 3| 2| 1|
# | 0| 1| 2| 4| 0|
# | 1| 1| 1| 1| 1|
# +------+------+------+------+------+
acols = [x for x in df.columns if x.startswith('A_')]
bcols = [x for x in df.columns if x.startswith('B_')]
df = (df.withColumn('A_avg', sum(F.col(x) for x in acols) / len(acols))
.withColumn('B_avg', sum(F.col(x) for x in bcols) / len(bcols)))
Result
+------+------+------+------+------+-----+-----+
|A_col1|B_col2|B_col3|A_col4|B_col5|A_avg|B_avg|
+------+------+------+------+------+-----+-----+
| 2| 4| 3| 2| 1| 2.0| 2.66|
| 0| 1| 2| 4| 0| 2.0| 1.0|
| 1| 1| 1| 1| 1| 1.0| 1.0|
+------+------+------+------+------+-----+-----+
Upvotes: 1