ajbentley
ajbentley

Reputation: 303

Trying to strip +4 from zip code in mixed type pd series

I know similar questions have been asked, but none of the solutions I've found have worked (I've listed some of them at the bottom).

I have a list of zip codes where some of them include the +4, which I don't need. Others are either too short (typos) or foreign. Currently all cells are objects.

Example could be:

member  state   country     zip         joined
16081   NY  UNITED STATES   11215       9/4/09
21186   NY  UNITED STATES   5325        8/9/11
34999   NY  UNITED STATES   11218       11/4/16 
34999   NY  FOOBAR STATES   NaN         11/4/16 
5033    NY  UNITED STATES   11238-1630  11/7/16 
35079   NY  FOOBAR STATES   SW4 9JX     11/13/16    
35084   NY  UNITED STATES   11217-2181  11/14/16    

and I'd like to end up with

member  state  country      zip         joined
16081   NY  UNITED STATES   11215       9/4/09
21186   NY  UNITED STATES   5325        8/9/11
34999   NY  UNITED STATES   11218       11/4/16 
34999   NY  FOOBAR STATES   NA          11/4/16 
5033    NY  UNITED STATES   11238       11/7/16 
35079   NY  FOOBAR STATES   SW4 9JX     11/13/16    
35084   NY  UNITED STATES   11217       11/14/16    

Here are a few things I've tried in terms of coding:

for x in df.zip:
    if len(x) > 5:
        print x.split("-")[0]
        x[:x.index("-")]

returns TypeError: object of type 'float' has no len()


df['zips'] = df['zip'].map(lambda x: x.rstrip('-'/n))

returns NameError: global name 'n' is not defined


def zipclip(x): 
    if x.isnumeric:
          if len(x) > 5:
              return z[:5]
          elif len(x) < 5:
              return "NA"

returns AttributeError: 'str' object has no attribute 'isnumeric'


df.zip = [line[:5] if line[:5].isnumeric() and line[6:].isnumeric() else\
line for line in zip if line]

returns TypeError: 'builtin_function_or_method' object is not iterable


Here are some of the places I've looked:

(sorry if I've gone overboard with documentation--I've been criticized in the past and wanted to make sure folks knew I've been working on it)

Upvotes: 2

Views: 998

Answers (3)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210952

Try this:

In [88]: df.zip = df.zip.str.split('-').str[0]

In [89]: df
Out[89]:
   member state        country      zip    joined
0   16081    NY  UNITED STATES    11215    9/4/09
1   21186    NY  UNITED STATES     5325    8/9/11
2   34999    NY  UNITED STATES    11218   11/4/16
3   34999    NY  FOOBAR STATES      NaN   11/4/16
4    5033    NY  UNITED STATES    11238   11/7/16
5   35079    NY  FOOBAR STATES  SW4 9JX  11/13/16
6   35084    NY  UNITED STATES    11217  11/14/16

Upvotes: 1

ajbentley
ajbentley

Reputation: 303

I'm guessing you guys have done it in a more pythonically correct way (which I'll be checking out)! But this worked for me ( just needed to change the NaN to a string, "NA"):

df['zip'].fillna('NA', inplace=True)                   
def ziplist(x):
    if len(x) > 5:
        return x[0:4]
    else:
        return x

df['zip'] = df['zip'].apply(ziplist);

Upvotes: 0

boot-scootin
boot-scootin

Reputation: 12515

You can take the zip column and turn it into a list

df = pd.DataFrame({"zip": ['11215', '5325', '11218', np.nan, '11238-1630', 'SW4 9JX', '11217-2181']})

df['new_zip'] = [str(x).split('-')[0] if str(x) != 'nan' else 'NA' for x in df['zip']]

The split method here will only "split" if there's a '-' in the string, otherwise the whole string itself will be returned.

The results in :

>>> df
          zip  new_zip
0       11215    11215
1        5325     5325
2       11218    11218
3         NaN       NA
4  11238-1630    11238
5     SW4 9JX  SW4 9JX
6  11217-2181    11217

Upvotes: 0

Related Questions