Isak Baizley
Isak Baizley

Reputation: 1843

Pandas Split Column String and Plot unique values

I have a dataframe Df that looks like this:

                        Country  Year  
0                Australia, USA  2015   
1            USA, Hong Kong, UK  1982   
2                           USA  2012   
3                           USA  1994   
4                   USA, France  2013   
5                         Japan  1988   
6                         Japan  1997   
7                           USA  2013   
8                        Mexico  2000   
9                       USA, UK  2005   
10                          USA  2012   
11                      USA, UK  2014   
12                          USA  1980   
13                          USA  1992   
14                          USA  1997   
15                          USA  2003   
16                          USA  2004   
17                          USA  2007    
18                 USA, Germany  2009   
19                        Japan  2006   
20                        Japan  1995  

I want to make a bar chart for the Country column, if i try this

Df.Country.value_counts().plot(kind='bar')

I get this plot

enter image description here

which is incorrect because it doesn't separate the countries. My goal is to obtain a bar chart that plots the count of each country in the column, but to achieve that, first i have to somehow split the string in each row (if needed) and then plot the data. I know i can use Df.Country.str.split(', ') to split the strings, but if i do this i can't plot the data.

Anyone has an idea how to solve this problem?

Upvotes: 2

Views: 2210

Answers (3)

unutbu
unutbu

Reputation: 879869

You could use the vectorized Series.str.split method to split the Countrys:

In [163]: df['Country'].str.split(r',\s+', expand=True)
Out[163]: 
            0          1     2
0   Australia        USA  None
1         USA  Hong Kong    UK
2         USA       None  None
3         USA       None  None
4         USA     France  None
...

If you stack this DataFrame to move all the values into a single column, then you can apply value_counts and plot as before:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(
{'Country': ['Australia, USA', 'USA, Hong Kong, UK', 'USA', 'USA', 'USA, France', 'Japan', 'Japan', 'USA', 'Mexico', 'USA, UK', 'USA', 'USA, UK', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA, Germany', 'Japan', 'Japan'],
 'Year': [2015, 1982, 2012, 1994, 2013, 1988, 1997, 2013, 2000, 2005, 2012, 2014, 1980, 1992, 1997, 2003, 2004, 2007, 2009, 2006, 1995]})
counts = df['Country'].str.split(r',\s+', expand=True).stack().value_counts()
counts.plot(kind='bar')
plt.show()

Upvotes: 6

user1846747
user1846747

Reputation:

new_df = pd.concat([Series(row['Year'], row['Country'].split(',')) for _, row in DF.iterrows()]).reset_index()

(DF is your old DF). this will give you one data point for each country name.

Hope this helps.

Cheers!

Upvotes: 1

Alexander
Alexander

Reputation: 109546

from collections import Counter

c = pd.Series(Counter(df.Country.str.split(',').sum()))
>>> c.plot(kind='bar', title='Country Count')

enter image description here

Upvotes: 2

Related Questions