Reputation: 7500
I'm having trouble dropping a column in a Spark dataframe that has a period. I know that you need to escape the column name using backticks (`). This works when I attempt to select columns, and indeed I've written my own little static function to escape all column names:
@staticmethod
def escape(columns):
return ["`" + col + "`" if "." in col else col for col in columns]
This can then be used to get my desired list of columns to select by:
desired_columns = MySparkClass.escape(
list(filter(lambda col: re.search('targetRegexStuffHere', col), target_df.columns))
)
filtered_df = df.select(desired_columns)
Using a trivial, reproducible example:
same = sqlContext.createDataFrame(
[
(1, 1, 'A', '2017-01-01'),
(2, 3, 'B', '2017-01-02'),
(3, 5, 'A', '2017-01-03'),
(4, 7, 'B', '2017-01-04')
],
('index', 'X', 'label.X.L.', 'date')
)
print(same.select('`label.X.L.`').collect())
Output here is:
[Row(label.X.L.='A'), Row(label.X.L.='B'), Row(label.X.L.='A'), Row(label.X.L.='B')]
However, removing the backticks results in an AnalysisException
:
pyspark.sql.utils.AnalysisException: 'syntax error in attribute name: label.X.L.;'
When I attempt to drop the label.X.L.
column, however, the backticks appear to not make any difference:
print(same.drop('`label.X.L.`').collect())
Output is
[Row(index=1, X=1, label.X.L.='A', date='2017-01-01'),
Row(index=2, X=3, label.X.L.='B', date='2017-01-02'),
Row(index=3, X=5, label.X.L.='A', date='2017-01-03'),
Row(index=4, X=7, label.X.L.='B', date='2017-01-04')]
What is the proper way to drop a column that contains a period within its name?
Upvotes: 1
Views: 2367
Reputation: 7500
The syntax for specifying which columns to use for select()
and for drop()
slightly different. When you have a period in your column name for select()
:
same.select('`label.X.L.`') # note the backticks
However, when you are attempting to drop:
same.drop('label.X.L.') # note the absence of the backticks
Upvotes: 2