What are the advantages of using column objects instead of strings in PySpark

In PySpark one can use column objects and strings to select columns. Both ways return the same result. Is there any difference? When should I use column objects instead of strings? For example, I can use a column object:

import pyspark.sql.functions as F

df.select(F.lower(F.col('col_name')))
# or
df.select(F.lower(df['col_name']))
# or
df.select(F.lower(df.col_name))

Or I can use a string instead and get the same result:

df.select(F.lower('col_name'))

Upvotes: 2

Answers (2)

Steven

Reputation: 15258

It depends on how the functions are implemented in Scala. In scala, the signature of the function is part of the function itself. For example, func(foo: str) and func(bar: int) are two different functions and Scala can make the difference whether you call one or the other depending on the type of argument you use.

F.col('col_name')), df['col_name'] and df.col_name are the same type of object, a column. It is almost the same to use one syntax or another. A little difference is that you could write for example :

df_2.select(F.lower(df.col_name))  # Where the column is from another dataframe 
# Spoiler alert : It may raise an error !!

When you call df.select(F.lower('col_name')), if the function lower(smth: str) is not defined in Scala, then you will have an error. Some functions are defined with str as input, others take only columns object. Try it to know if it works and then uses it. otherwise, you can make a pull request on the spark project to add the new signature.

Upvotes: 2

dsk

Reputation: 2003

Read this PySpark style guide from Palantir here which explains when to use F.col() and not and best practices. Git Link here

In many situations the first style can be simpler, shorter and visually less polluted. However, we have found that it faces a number of limitations, that lead us to prefer the second style:

If the dataframe variable name is large, expressions involving it quickly become unwieldy; If the column name has a space or other unsupported character, the bracket operator must be used instead. This generates inconsistency, and df1['colA'] is just as difficult to write as F.col('colA'); Column expressions involving the dataframe aren't reusable and can't be used for defining abstract functions; Renaming a dataframe variable can be error-prone, as all column references must be updated in tandem. Additionally, the dot syntax encourages use of short and non-descriptive variable names for the dataframes, which we have found to be harmful for maintainability. Remember that dataframes are containers for data, and descriptive names is a helpful way to quickly set expectations about what's contained within.

By contrast, F.col('colA') will always reference a column designated colA in the dataframe being operated on, named df, in this case. It does not require keeping track of other dataframes' states at all, so the code becomes more local and less susceptible to "spooky interaction at a distance," which is often challenging to debug.

Upvotes: 3

What are the advantages of using column objects instead of strings in PySpark

Answers (2)

Related Questions