Reputation: 363
I have a table like this:
a | a_vector |
1 | 710.83;-776.98;-10.86;2013.02;-896.28; |
2 | 3 ; 2 ; 1 |
Using PySpark/pandas, how do I dynamically create columns so that first values in vector go to "col1" and second values go to "col2" etc. + calculate the sum?
a | a_vector | col1 | col2 | col3
1 | 300;-200;2022; | 300 | -200 | 2022
2 | 3 ; 2 ; 1 | 3 | 2 | 1
The final requirement is to have the sums of new columns sorted in one column.
Upvotes: 1
Views: 726
Reputation: 24356
In PySpark, you can do it by first splitting your string on ;
(creating an array) and then selecting columns using list comprehension. The sum can be calculated using aggregate
higher-order function.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('1', '300;-200;2022'),
('2', '3 ; 2 ; 1')],
['a', 'a_vector']
)
Script:
col_arr = F.split('a_vector', '\s*;\s*')
max_size = df.agg(F.max(F.size(col_arr))).head()[0]
df = df.select(
'*',
*[col_arr[i].alias(f'col{i}') for i in range(max_size)],
F.aggregate(col_arr, F.lit(0.0), lambda acc, x: acc + x).alias('sum')
)
df.show()
# +---+-------------+----+----+----+------+
# | a| a_vector|col0|col1|col2| sum|
# +---+-------------+----+----+----+------+
# | 1|300;-200;2022| 300|-200|2022|2122.0|
# | 2|3 ; 2 ; 1| 3| 2| 1| 6.0|
# +---+-------------+----+----+----+------+
Based on the comments, the following is probably what you need. Calculating max_size of array elements, calculating column-wise sums and sorting the results.
col_arr = F.split('a_vector', '\s*;\s*')
max_size = df.agg(F.max(F.size(col_arr))).head()[0]
df = df.agg(F.array_sort(F.array([F.sum(col_arr[i]) for i in range(max_size)])).alias('sum'))
df.show(truncate=0)
# +-----------------------+
# |sum |
# +-----------------------+
# |[-198.0, 303.0, 2023.0]|
# +-----------------------+
Upvotes: 2
Reputation: 4062
with names
dfs = df['a_vector'].str.split(';', expand=True).rename(columns = lambda x: "col"+str(x+1))
df =pd.concat([df, dfs], axis=1)
print(df)
inputs
a a_vector
0 1 300;-200;2022
1 2 3;2;1
Output
a a_vector col1 col2 col3
0 1 300;-200;2022 300 -200 2022
1 2 3;2;1 3 2 1
Note: Not tested..Let me know if deosn't work
Upvotes: 1