Reputation: 2231
I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:
df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']
I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:
['hello_world','hello_country','hello_everyone','index']
I want something like df.select('hello*','index')
Thanks in advance :)
EDIT:
I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it
Upvotes: 9
Views: 27289
Reputation: 347
i used Manrique answer and improvised.
sel_cols = [i for i in df.columns if i.startswith("colName")]
df = df.select('*', *(F.col(x).alias('rename_text' + x) for x in sel_cols))
Upvotes: 0
Reputation: 3110
You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.
Upvotes: 6
Reputation: 2231
I've found a quick and elegant way:
selected = [s for s in df.columns if 'hello' in s]+['index']
df.select(selected)
With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.
Upvotes: 18
Reputation: 1983
This sample code does what you want:
hello_cols = []
for col in df.columns:
if(('index' in col) or ('hello' in col)):
hello_cols.append(col)
df.select(*hello_cols)
Upvotes: 1