JC23
JC23

Reputation: 1328

fastest way to iterate pandas series/column

I'm more used to for loops but they can become slow in pandas once you get large sets of data. I keep finding iterrows, iter..., etc. examples but want to know if there's a faster way. What I currently have now is

newnames = []
names = df['name'].tolist()
for i in names:
  i = i.replace(' ','_')
  newnames.append(i)

and then I could add the newnames list to the df as a pandas column OR should I rewrite the existing df['name'] values in place? Not too familiar with pandas best practices so I welcome all feedback. Thanks

Upvotes: 0

Views: 245

Answers (2)

ifly6
ifly6

Reputation: 5331

Just use the vectorised string operation:

newnames = df['name'].str.replace(' ', '_', regex=False).tolist()

Generally, with Pandas, you want to avoid doing loops if possible. There is usually some way to get around doing a loop if you look in the library, so there's some level of syntax research with Pandas (unless what you're looking for is pretty nonstandard).

Basically, if something you want to do would prima facie require a for loop and doing that is probably something that people would want to do regularly, it's probably in the library.

Upvotes: 3

SeaBean
SeaBean

Reputation: 23227

If you finally want to add the newnames to df, you could do it directly by:

df['newnames'] = df['name'].str.replace(' ', '_')

If you just want to change name column to replace all spaces by _, you can also do it directly on the original column (overwrite it), as follows:

df['name'] = df['name'].str.replace(' ', '_')

In both ways, we are doing it using Pandas' vectorized operation which has been optimized for faster execution, instead of using looping which has not been optimized and is slow.

Upvotes: 1

Related Questions