Reputation: 209
I am trying to process very large files (10,000+ observsstions) where zip codes are not easily formatted. I need to convert them all to just the first 5 digits, and here is my current code:
def makezip(frame, zipcol):
i = 0
while i < len(frame):
frame[zipcol][i] = frame[zipcol][i][:5]
i += 1
return frame
frame is the dataframe, and zipcol is the name of the column containing the zip codes. Although this works, it takes a very long time to process. Is there a quicker way?
Upvotes: 5
Views: 470
Reputation: 139162
You can use the .str
accessor on string columns to access some specific string methods. And on this, you can also slice:
frame[zipcol] = frame[zipcol].str[:5]
Based on a small example, this is around 50 times faster as looping over the rows:
In [29]: s = pd.Series(['testtest']*10000)
In [30]: %timeit s.str[:5]
100 loops, best of 3: 3.06 ms per loop
In [31]: %timeit str_loop(s)
10 loops, best of 3: 164 ms per loop
whith
In [27]: def str_loop(s):
.....: for i in range(len(s)):
.....: s[i] = s[i][:5]
.....:
Upvotes: 7