tomasn4a
tomasn4a

Reputation: 615

Chaining string operations on Pandas Series

I recently found out about the str method for Pandas series and it's great! However if I want to chain operations (say, a couple replace and a strip) I need to keep calling str after every operation, making it not the most elegant code.

For example, lets say my column names contain spaces and periods and I want to replace them by underscores. I might also want to strip any leftover underscores. If I wanted to do this using str methods, is there any way of avoiding having to run:

df.columns.str.replace(' ', '_').str.replace('.', '_').str.strip('_')

Thanks!

Upvotes: 8

Views: 3929

Answers (3)

Maciej Skorski
Maciej Skorski

Reputation: 3354

Let me add my two cents to improve the answers:

I was just curious if we could chain str operations together

We would like to have something like tweet.str.replace('@',).strip().lower(). In fact, we could hope that a chain of operations can be even further optimized (compiled) into something like tweet.str.replace_strip_lower_combined.

While this is perfectly reasonable, the current API only processes one operation at a time and doesn't support such combining.

Why not to use list comprehensions

Because of performance: pd.Series.str offers vectorized string functions.

Upvotes: 0

cs95
cs95

Reputation: 402824

Why not use a list comprehension?

import re
df.columns = [re.sub('[\s.]', '_', x).strip('_') for x in df.columns]

In a list comp, you're working with the string object directly, without the need to call .str each time.

Upvotes: 2

jezrael
jezrael

Reputation: 863176

I think need str repeat for each .str function, it is per design.


But here is possible use only one replace:

df = pd.DataFrame(columns=['aa dd', 'dd.d_', 'd._'])

print (df)
Empty DataFrame
Columns: [aa dd, dd.d_, d._]
Index: []

print (df.columns.str.replace('[\s+.]', '_').str.strip('_'))
Index(['aa_dd', 'dd_d', 'd'], dtype='object')

Upvotes: 7

Related Questions