pyspark agg tells me there are wrong chrachters in the column name but the names seem correct

Question

I use spark 2.3.2 and I want to aggregate 2 columns but the .agg() function tells me that there is a problem with the column names but I don't see it.

some peudo code with the actual column names:

df = spark.read.parquet('./my_files')

[... doing some stuff with the data everything works fine ...]

df2 = df.groupBy(AD_ID).agg({'pagerank':'sum','pagerankRAW':'sum'})

when I do that spark throws me the exception: AnalysisException: 'Attribute name "sum(pagerankRAW)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;' but I don't see the invalid characters .... there are only letters in my column name. When I delete 'pagerankRAW':'sum' from the dict I get the same error but this time for sum(pagerank)

so what do I do wrong?

Josselin G. · Accepted Answer

It looks like a weird one, pyspark should be able to handle parenthesis

I use a different syntax when I use agg() though.

I'd use .agg(sum("pagerank"), sum("pagerankRAW")) and I don't get this error

I don't think you can use alias() with your syntax because I don't see where to place it

With alias .agg(sum("pagerank").alias("pagerank"), sum("pagerankRAW").alias("pagerankRAW))

pyspark agg tells me there are wrong chrachters in the column name but the names seem correct

Answers (1)

Related Questions