Kaushik Acharya
Kaushik Acharya

Reputation: 1590

How to get name of dataframe column in PySpark?

In pandas, this can be done by column.name.

But how to do the same when it's a column of Spark dataframe?

E.g. the calling program has a Spark dataframe: spark_df

>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']

This program calls my function: my_function(spark_df['rank'])
In my_function, I need the name of the column, i.e. 'rank'.

If it was pandas dataframe, we could use this:

>>> pandas_df['rank'].name
'rank'

Upvotes: 86

Views: 235208

Answers (8)

ZygD
ZygD

Reputation: 24386

This should cover even the strangest cases:

  • column with/ without an alias
  • several aliases
  • aliases containing multiple words
  • column names surrounded with backticks
  • intentional backticks in aliases
def get_col_name(col):
    if isinstance(col, str):
        return col
    if str(col)[-3] != '`':
        return str(col).split("'")[-2].split(" AS ")[-1]
    return str(col).replace('``', '`').split(" AS `")[-1].split("`'")[-2]

Upvotes: 4

ciurlaro
ciurlaro

Reputation: 1014

Depending on the specific case:

  1. Unaliased columns: column._jc.toString() already works fine (thanks @numeral).

  2. Aliased columns: (i.e. column.alias("whatever") ) the alias can be extracted:

    • Python

      str(column).split(" AS ")[1].split("`")[1] 
      
    • Scala

      str(column).split(" AS ")(1).split("`")(1)
      

Upvotes: 11

Le Poissons
Le Poissons

Reputation: 59

Since none of the answers have been marked as the Answer - I may be over-simplifying the OPs ask but:

my_list = spark_df.schema.fields
for field in my_list:
    print(field.name)

Upvotes: 2

prashant deo
prashant deo

Reputation: 89

#table name as an example if you have multiple

loc = '/mnt/tablename' or 'whatever_location/table_name' #incase of external table or any folder 

table_name = ['customer','department']

for i in table_name:
  print(i) # printing the existing table name

  df = spark.read.format('parquet').load(f"{loc}{i.lower()}/") # creating dataframe from the table name
  for col in df.dtypes:
    print(col[0]) # column_name as per availability

    print(col[1]) # datatype information of the respective column

Upvotes: 0

Yucci Mel
Yucci Mel

Reputation: 189

I found the answer is very very simple...

// It is in Java, but it should be same in PySpark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();

The variable theNameOftheCol is "colName"

Upvotes: 4

Pat
Pat

Reputation: 737

If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:

>>> df.columns['High'] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str

However, calling the columns method on your dataframe, which you have done, will return a list of column names:

df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

If you want the column datatypes, you can call the dtypes method:

df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]

If you want a particular column, you'll need to access it by index:

df.columns[2] will return 'High'

Upvotes: 6

numeral
numeral

Reputation: 544

The only way is to go an underlying level to the JVM.

df.col._jc.toString().encode('utf8')

This is also how it is converted to a str in the pyspark code itself.

From pyspark/sql/column.py:

def __repr__(self):
    return 'Column<%s>' % self._jc.toString().encode('utf8')

Upvotes: 28

David
David

Reputation: 11583

You can get the names from the schema by doing

spark_df.schema.names

Printing the schema can be useful to visualize it as well

spark_df.printSchema()

Upvotes: 123

Related Questions