Reputation: 344
I have pyspark dataframe df
having IP as column_name
like below :
summary `0.0.0.0` 8.8.8.8 1.0.0.0 1.1.1.1
count 14 14 14 14
min 123 231 423 54
max 2344 241 555 100
When I am doing df.columns
it is giving me a below column list but in list special character of 1st column back quote is missing.
[0.0.0.0, 8.8.8.8 ,1.0.0.0,1.1.1.1]
And when I am performing any operation using this list it gives me an error column 0.0.0.0 not present in dataframe
.
Also, I tried to change column_name
by using the below code but is not changing because it is not in the list.
import re
df = df.select([F.col(col).alias(re.sub("[`]+","",i)) for col in df.columns])
How to resolve this issue?
Schema of the df is like below after performing df.printSchema()
root
|-- summary: string (nullable = true)
|-- 0.0.0.0: string (nullable = true)
|-- 8.8.8.8: string (nullable = true)
|-- 1.0.0.0: string (nullable = true)
|-- 1.1.1.1: string (nullable = true)
Upvotes: 0
Views: 2222
Reputation: 6082
With numbers as the first character of the column name, you always can force adding backticks when query from it
df.select('summary', '`0.0.0.0`').show()
# +-------+-------+
# |summary|0.0.0.0|
# +-------+-------+
# | count| 14|
# | min| 123|
# | max| 2344|
# +-------+-------+
df.select(['summary'] + [f'`{col}`' for col in df.columns if col != 'summary']).show()
# +-------+-------+-------+-------+-------+
# |summary|0.0.0.0|8.8.8.8|1.0.0.0|1.1.1.1|
# +-------+-------+-------+-------+-------+
# | count| 14| 14| 14| 14|
# | min| 123| 231| 423| 54|
# | max| 2344| 241| 555| 100|
# +-------+-------+-------+-------+-------+
Upvotes: 1