Joel AMEDON
Joel AMEDON

Reputation: 1

pyspark: for loop calculations over the columns

Anyone know how can I do theses calculations in pyspark?

data = {
    'Name': ['Tom', 'nick', 'krish', 'jack'],
    'Age': [20, 21, 19, 18],
    'CSP': [2, 6, 8, 7],
    'coef': [2, 2, 3, 3]
}
  
# Create DataFrame
df = pd.DataFrame(data)
colsToRecalculate = ['Age','CSP']

for i in range(len(colsToRecalculate)):
    df[colsToRecalculate[i]] =df[colsToRecalculate[i]]/df["coef"]

Upvotes: 0

Views: 246

Answers (2)

samkart
samkart

Reputation: 6644

Slight variation to bzu's answer which selects non-listed columns manually within the select. We can use dataframe.columns and check the columns against the colsToRecalculate list - If column is in the list, do the calculation, else leave column as is.

data_sdf. \
    select(*[(func.col(k) / func.col('coef')).alias(k) if k in colsToRecalculate else k for k in data_sdf.columns])

Upvotes: 0

bzu
bzu

Reputation: 1594

You can use select() on spark dataframe and include multiple columns (with different calculations) as parameters. In your case:

df2 = spark.createDataFrame(pd.DataFrame(data))
df2.select(*[(F.col(c) / F.col('coef')).alias(c) for c in colsToRecalculate], 'coef').show()

Upvotes: 1

Related Questions