Reputation: 1590
In pandas, this can be done by column.name
.
But how to do the same when it's a column of Spark dataframe?
E.g. the calling program has a Spark dataframe: spark_df
>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']
This program calls my function: my_function(spark_df['rank'])
In my_function
, I need the name of the column, i.e. 'rank'
.
If it was pandas dataframe, we could use this:
>>> pandas_df['rank'].name
'rank'
Upvotes: 86
Views: 235208
Reputation: 24386
This should cover even the strangest cases:
def get_col_name(col):
if isinstance(col, str):
return col
if str(col)[-3] != '`':
return str(col).split("'")[-2].split(" AS ")[-1]
return str(col).replace('``', '`').split(" AS `")[-1].split("`'")[-2]
Upvotes: 4
Reputation: 1014
Depending on the specific case:
Unaliased columns: column._jc.toString()
already works fine (thanks @numeral).
Aliased columns: (i.e. column.alias("whatever")
) the alias can be extracted:
Python
str(column).split(" AS ")[1].split("`")[1]
Scala
str(column).split(" AS ")(1).split("`")(1)
Upvotes: 11
Reputation: 59
Since none of the answers have been marked as the Answer - I may be over-simplifying the OPs ask but:
my_list = spark_df.schema.fields
for field in my_list:
print(field.name)
Upvotes: 2
Reputation: 89
#table name as an example if you have multiple
loc = '/mnt/tablename' or 'whatever_location/table_name' #incase of external table or any folder
table_name = ['customer','department']
for i in table_name:
print(i) # printing the existing table name
df = spark.read.format('parquet').load(f"{loc}{i.lower()}/") # creating dataframe from the table name
for col in df.dtypes:
print(col[0]) # column_name as per availability
print(col[1]) # datatype information of the respective column
Upvotes: 0
Reputation: 189
I found the answer is very very simple...
// It is in Java, but it should be same in PySpark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();
The variable theNameOftheCol
is "colName"
Upvotes: 4
Reputation: 737
If you want the column names of your dataframe, you can use the pyspark.sql
class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:
>>> df.columns['High']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers, not str
However, calling the columns method on your dataframe, which you have done, will return a list of column names:
df.columns
will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']
If you want the column datatypes, you can call the dtypes
method:
df.dtypes
will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]
If you want a particular column, you'll need to access it by index:
df.columns[2]
will return 'High'
Upvotes: 6
Reputation: 544
The only way is to go an underlying level to the JVM.
df.col._jc.toString().encode('utf8')
This is also how it is converted to a str
in the pyspark code itself.
From pyspark/sql/column.py:
def __repr__(self):
return 'Column<%s>' % self._jc.toString().encode('utf8')
Upvotes: 28
Reputation: 11583
You can get the names from the schema by doing
spark_df.schema.names
Printing the schema can be useful to visualize it as well
spark_df.printSchema()
Upvotes: 123