column not present in pyspark dataframe?

Question

I have pyspark dataframe df having IP as column_name like below :

summary `0.0.0.0` 8.8.8.8  1.0.0.0 1.1.1.1
 count      14      14      14      14
 min        123    231      423     54
 max        2344   241      555     100

When I am doing df.columns it is giving me a below column list but in list special character of 1st column back quote is missing.

    [0.0.0.0, 8.8.8.8 ,1.0.0.0,1.1.1.1]

And when I am performing any operation using this list it gives me an error column 0.0.0.0 not present in dataframe.

Also, I tried to change column_name by using the below code but is not changing because it is not in the list.

  import re
  df = df.select([F.col(col).alias(re.sub("[`]+","",i)) for col in df.columns])

How to resolve this issue?

Schema of the df is like below after performing df.printSchema()

 root
   |-- summary: string (nullable = true)
   |-- 0.0.0.0: string (nullable = true)
   |-- 8.8.8.8: string (nullable = true)
   |-- 1.0.0.0: string (nullable = true)
   |-- 1.1.1.1: string (nullable = true)

pltc · Accepted Answer

With numbers as the first character of the column name, you always can force adding backticks when query from it

df.select('summary', '`0.0.0.0`').show()
# +-------+-------+
# |summary|0.0.0.0|
# +-------+-------+
# |  count|     14|
# |    min|    123|
# |    max|   2344|
# +-------+-------+

df.select(['summary'] + [f'`{col}`' for col in df.columns if col != 'summary']).show()
# +-------+-------+-------+-------+-------+
# |summary|0.0.0.0|8.8.8.8|1.0.0.0|1.1.1.1|
# +-------+-------+-------+-------+-------+
# |  count|     14|     14|     14|     14|
# |    min|    123|    231|    423|     54|
# |    max|   2344|    241|    555|    100|
# +-------+-------+-------+-------+-------+

column not present in pyspark dataframe?

Answers (1)

Related Questions