Reputation: 325
I'm having difficulty filtering this CSV file.
Here are a few entries from the csv table:
Name Info Bio
Alice Woman: 21y (USA) Actress
Breonna Woman: (France) Singer
Carla Woman: 30y (Trinidad and Tobago) Actress
Diana Woman: (USA) Singer
I'm trying to filter the Info row to get all a list of countries and their frequency. I'm also trying to do the same thing with age. As you can see, not all women post their age.
I've tried
women= pd.read_csv('women.csv')
women_count = pd.Series(' '.join(women.Info).split()).value_counts()
However, that splits everything and outputs:
Woman: 4
(USA) 2
21y 1
(Trinidad 1
and 1
Tobago) 1
30y 1
I should add I've tried women_filtered = women[women['Info'] == '(USA)']
and that doesn't work
My questions are:
Thank You
Upvotes: 1
Views: 84
Reputation: 26676
print(df)
Name Info Bio
0 Alice Woman: 21y (USA) Actress
1 Carla 30y (Trinidad and Tobago) Singer
2 Breonna Woman: (France) Actress
3 Diana Woman: (USA) Singer
#Solution
#Extract Name of countries
df=df.assign(Age=df.Info.str.extract('(\d+(?=\D))'), Countries=df.Info.str.extract('\((.*?)\)'))
Name Info Bio Age Countries
0 Alice Woman: 21y (USA) Actress 21 USA
1 Carla 30y (Trinidad and Tobago) Singer 30 Trinidad and Tobago
2 Breonna Woman: (France) Actress NaN France
3 Diana Woman: (USA) Singer NaN USA
#Filter without Age
df[df.Age.isna()]
Name Info Bio Age Countries
2 Breonna Woman: (France) Actress NaN France
3 Diana Woman: (USA) Singer NaN USA
Upvotes: 1
Reputation: 1541
import pandas as pd
df = pd.DataFrame(
{'Name':['Alice', 'Breonna', 'Carla', 'Diana'],
'Info':['Woman: 21y (USA)', 'Woman: (France)', 'Woman: 30y (Trinidad and Tobago)', 'Woman: (USA)'],
'Bio':['Actress', 'Singer', 'Actress', 'Singer']}
)
# defining columns using regex
df['country'] = df['Info'].str.extract('\(([^\)]+)\)')
df['age'] = df['Info'].str.extract('[\s]+([\d]{2})y[\s]+').astype(float)
df['noage'] = df['age'].isnull().astype(int)
# frequency of countries
sizes = df.groupby('country').size()
sizes
This will output the frequencies.
country
France 1
Trinidad and Tobago 1
USA 2
dtype: int64
I would lookup how to write regex expressions so you can learn how to extract information from strings yourself. Pythex.org is a nice website to try out regex expressions in Python and has some useful tips.
Upvotes: 1