shorttriptomars
shorttriptomars

Reputation: 325

Filtering CSV with pandas

I'm having difficulty filtering this CSV file.

Here are a few entries from the csv table:

Name      Info                                 Bio
Alice     Woman: 21y (USA)                     Actress
Breonna   Woman: (France)                      Singer
Carla     Woman: 30y (Trinidad and Tobago)     Actress
Diana     Woman: (USA)                         Singer

I'm trying to filter the Info row to get all a list of countries and their frequency. I'm also trying to do the same thing with age. As you can see, not all women post their age.

I've tried

women= pd.read_csv('women.csv')
women_count = pd.Series(' '.join(women.Info).split()).value_counts()

However, that splits everything and outputs:

Woman:     4
(USA)      2
21y        1
(Trinidad  1
and        1
Tobago)    1
30y        1

I should add I've tried women_filtered = women[women['Info'] == '(USA)'] and that doesn't work

My questions are:

  1. How can I split the string just to filter by countries especially since all the countries are in parenthesis?
  2. How can I filter for entries that do not have age?

Thank You

Upvotes: 1

Views: 84

Answers (2)

wwnde
wwnde

Reputation: 26676

print(df)

      Name                       Info      Bio
0    Alice           Woman: 21y (USA)  Actress
1    Carla  30y (Trinidad and Tobago)   Singer
2  Breonna            Woman: (France)  Actress
3    Diana               Woman: (USA)   Singer

#Solution



#Extract Name of countries
 df=df.assign(Age=df.Info.str.extract('(\d+(?=\D))'), Countries=df.Info.str.extract('\((.*?)\)'))

Name                       Info             Bio     Age                   Countries
    0    Alice           Woman: 21y (USA)  Actress   21                  USA
    1    Carla  30y (Trinidad and Tobago)   Singer   30  Trinidad and Tobago
    2  Breonna            Woman: (France)  Actress  NaN               France
    3    Diana               Woman: (USA)   Singer  NaN                  USA
    



#Filter without Age
df[df.Age.isna()]

     Name             Info      Bio  Age  Countries
2  Breonna  Woman: (France)  Actress  NaN    France
3    Diana     Woman: (USA)   Singer  NaN       USA

Upvotes: 1

jtorca
jtorca

Reputation: 1541

import pandas as pd

df = pd.DataFrame(
{'Name':['Alice', 'Breonna', 'Carla', 'Diana'],
 'Info':['Woman: 21y (USA)', 'Woman: (France)', 'Woman: 30y (Trinidad and Tobago)', 'Woman: (USA)'],
 'Bio':['Actress', 'Singer', 'Actress', 'Singer']}
)

# defining columns using regex
df['country'] = df['Info'].str.extract('\(([^\)]+)\)')
df['age'] = df['Info'].str.extract('[\s]+([\d]{2})y[\s]+').astype(float)
df['noage'] = df['age'].isnull().astype(int)

# frequency of countries
sizes = df.groupby('country').size()
sizes

This will output the frequencies.

country
France                 1
Trinidad and Tobago    1
USA                    2
dtype: int64

I would lookup how to write regex expressions so you can learn how to extract information from strings yourself. Pythex.org is a nice website to try out regex expressions in Python and has some useful tips.

Upvotes: 1

Related Questions