user15649753
user15649753

Reputation: 523

how to use groupby with respect of a row?

I want to do groupby with respect of a row:

col1  col2    col3   col4    col5
'A'    'B'      'B'   'A'     'B'
2       4        3     2       1
0       1        2     4       0
1       1        1     1       1

I want to filter this dataframe if two columns have the same values in the first row and put the average of them.

For example the first and the forth columns have the same value 'A` in the first row and the average of those row is :

first row: (2+2)/2=2
second row: (0+4)/2=2
third row: (1+1)/2=1

the same for columns with value B. so the output is

col1   col2
'A'      'B'
2        7/3
2        3/3
1        3/3

Upvotes: 0

Views: 103

Answers (1)

Emma
Emma

Reputation: 9308

As I mentioned in comment, having string and integer mixed in in a single column is not as practical. It is better to have the letter row as part of the column name or pivot this dataframe to have the letter row as a column.

I will show how you can make the letter row as part of column name and get average from there.

df = spark.createDataFrame([
    ['A', 'B', 'B', 'A', 'B'],
    ['2', '4', '3', '2', '1'],
    ['0', '1', '2', '4', '0'],
    ['1', '1', '1', '1', '1']
], ['col1', 'col2', 'col3', 'col4', 'col5'])

letter_row = df.filter(df.col1.rlike('[^\d]')).take(1)[0]
new_cols = [f'{letter_row[x]}_{x}' for x in letter_row.asDict()]

df = df.filter(df.col1.rlike('\d+')).toDF(*new_cols)

# df.show()
# +------+------+------+------+------+
# |A_col1|B_col2|B_col3|A_col4|B_col5|
# +------+------+------+------+------+
# |     2|     4|     3|     2|     1|
# |     0|     1|     2|     4|     0|
# |     1|     1|     1|     1|     1|
# +------+------+------+------+------+

acols = [x for x in df.columns if x.startswith('A_')]
bcols = [x for x in df.columns if x.startswith('B_')]

df = (df.withColumn('A_avg', sum(F.col(x) for x in acols) / len(acols))
      .withColumn('B_avg', sum(F.col(x) for x in bcols) / len(bcols)))

Result

+------+------+------+------+------+-----+-----+
|A_col1|B_col2|B_col3|A_col4|B_col5|A_avg|B_avg|
+------+------+------+------+------+-----+-----+
|     2|     4|     3|     2|     1|  2.0| 2.66|
|     0|     1|     2|     4|     0|  2.0|  1.0|
|     1|     1|     1|     1|     1|  1.0|  1.0|
+------+------+------+------+------+-----+-----+

Upvotes: 1

Related Questions