Thamali Wijewardhana
Thamali Wijewardhana

Reputation: 512

Extract integers from string value in a pandas data frame cell

I have a data frame train_df and it has a column named AgeuponOutcome which contains values as follows AgeuponOutcome

1 year
1 year
2 years
3 weeks
2 years
1 month
3 weeks
3 weeks
5 months
1 year
2 years
2 years
4 years

Given below is my full data set.

OutcomeType AnimalType  SexuponOutcome  AgeuponOutcome  Breed   Color
Return_to_owner Dog Neutered Male   1 year  Shetland Sheepdog Mix   Brown/White
Euthanasia  Cat Spayed Female   1 year  Domestic Shorthair Mix  Cream Tabby
Adoption    Dog Neutered Male   2 years Pit Bull Mix    Blue/White
Transfer    Cat Intact Male 3 weeks Domestic Shorthair Mix  Blue Cream
Transfer    Dog Neutered Male   2 years Lhasa Apso/Miniature Poodle Tan
Transfer    Dog Intact Female   1 month Cairn Terrier/Chihuahua Shorthair   Black/Tan
Transfer    Cat Intact Male 3 weeks Domestic Shorthair Mix  Blue Tabby
Transfer    Cat Unknown 3 weeks Domestic Shorthair Mix  Brown Tabby
Adoption    Dog Spayed Female   5 months    American Pit Bull Terrier Mix   Red/White
Adoption    Dog Spayed Female   1 year  Cairn Terrier   White
Transfer    Cat Unknown 2 years Domestic Shorthair Mix  Black
Adoption    Dog Spayed Female   2 years Miniature Schnauzer Mix Silver
Adoption    Dog Neutered Male   4 years Pit Bull Mix    Brown

Given below is the code I have used to extract the integers from the string values in AgeuponOutcome column.

word='month'  
l = len(train_df)
    for i in range(l):
        if word in train_df.loc[i, 'AgeuponOutcome']: 
            print re.findall("\d+", train_df.loc[i, 'AgeuponOutcome'])

But it gives the following error

TypeError                                 Traceback (most recent call last)
<ipython-input-60-8a6df57d3cb9> in <module>()
      1 l = len(train_df)
      2 for i in range(l):
----> 3     if word in train_df.loc[i, 'AgeuponOutcome']:
      4         print re.findall("\d+", train_df.loc[i, 'AgeuponOutcome'])

TypeError: argument of type 'int' is not iterable

Can you tell me how to fix the error and extract the value. For example, I need to extract 1 from '1 year' and print 1

Upvotes: 1

Views: 1623

Answers (1)

jezrael
jezrael

Reputation: 862511

You can use str.extract with contains and loc with boolean indexing:

df1 = (df.AgeuponOutcome.str.extract('(\d+) (\w+)', expand=True))
df1.columns = ['a','b']
print (df1)
    a       b
0   1    year
1   1    year
2   2   years
3   3   weeks
4   2   years
5   1   month
6   3   weeks
7   3   weeks
8   5  months
9   1    year
10  2   years
11  2   years
12  4   years

print (df1.loc[df1.b.str.contains('month'), 'a'])
5    1
8    5
Name: a, dtype: object

print (df1.loc[df1.b.str.contains('year'), 'a'])
0     1
1     1
2     2
4     2
9     1
10    2
11    2
12    4
Name: a, dtype: object

If you need output as new columns:

df1['month'] = (df1.loc[df1.b.str.contains('month'), 'a'])
df1['year'] = (df1.loc[df1.b.str.contains('year'), 'a'])
df1['week'] = (df1.loc[df1.b.str.contains('week'), 'a'])
print (df1)
    a       b month year week
0   1    year   NaN    1  NaN
1   1    year   NaN    1  NaN
2   2   years   NaN    2  NaN
3   3   weeks   NaN  NaN    3
4   2   years   NaN    2  NaN
5   1   month     1  NaN  NaN
6   3   weeks   NaN  NaN    3
7   3   weeks   NaN  NaN    3
8   5  months     5  NaN  NaN
9   1    year   NaN    1  NaN
10  2   years   NaN    2  NaN
11  2   years   NaN    2  NaN
12  4   years   NaN    4  NaN

EDIT by comment:

You can use:

#convert to int
df1['a'] = df1.a.astype(int)

#divide by constant to column a
df1.loc[df1.b.str.contains('month'), 'a'] = df1.loc[df1.b.str.contains('month'), 'a'] / 12
df1.loc[df1.b.str.contains('week'), 'a'] = df1.loc[df1.b.str.contains('week'), 'a']  /52.1429
print (df1)
           a       b
0   1.000000    year
1   1.000000    year
2   2.000000   years
3   0.057534   weeks
4   2.000000   years
5   0.083333   month
6   0.057534   weeks
7   0.057534   weeks
8   0.416667  months
9   1.000000    year
10  2.000000   years
11  2.000000   years
12  4.000000   years

Upvotes: 2

Related Questions