Faster processing of Dataframe in Pandas

Question

I am trying to process very large files (10,000+ observsstions) where zip codes are not easily formatted. I need to convert them all to just the first 5 digits, and here is my current code:

def makezip(frame, zipcol):
    i = 0
    while i < len(frame):
        frame[zipcol][i] = frame[zipcol][i][:5]
        i += 1
    return frame

frame is the dataframe, and zipcol is the name of the column containing the zip codes. Although this works, it takes a very long time to process. Is there a quicker way?

joris · Accepted Answer

You can use the .str accessor on string columns to access some specific string methods. And on this, you can also slice:

frame[zipcol] = frame[zipcol].str[:5]

Based on a small example, this is around 50 times faster as looping over the rows:

In [29]: s = pd.Series(['testtest']*10000)

In [30]: %timeit s.str[:5]
100 loops, best of 3: 3.06 ms per loop

In [31]: %timeit str_loop(s)
10 loops, best of 3: 164 ms per loop

whith

In [27]: def str_loop(s):
   .....:     for i in range(len(s)):
   .....:         s[i] = s[i][:5]
   .....:

Faster processing of Dataframe in Pandas

Answers (1)

Related Questions