noob
noob

Reputation: 3811

Scatter plot same size of both variables_difficulty implementing

Data Source: https://www.kaggle.com/worldbank/world-development-indicators Folder: 'world-development-indicators' File: Indicators.csv

I am trying to plot scatter plot between two variables. However the sizes of both variables is not same.

Database looks like this: It is saved by name data:

CountryCode IndicatorName                   Year    Value
USA         Population, total               1993    72498
USA         Population, total               1994    76700
USA         Population, female (% of total) 1993    50.52691109
USA         Population, female (% of total) 1994    50.57235984
USA         GDP per capita (const 2005 US$) 1994    23086.93795
USA         Population, female (% of total) 1988    50.91933134
USA         Population, total               1988    61077

I want to plot the scatter plot between 2 things: Absolute female population and GDP per capita (const 2005 US$). Absolute female population = Population, Total * Population, female(%)

Challenges are as below:

a) The total Population, female population and GDP values exist for different number of years for one country. For example, for USA, lets say the number of values for Population, total exist only for say 20 years and female population figures are for given for 18 years and GDP values are given for only 10 years.

There are no NAN/Null values

I need those values where the values for all these parameters are present in a country for a given year.

I am new to python, so I am unable to formulate what I want in a code. Can anyone please help:

 femalepop_filter = data['IndicatorName'].str.contains('Population,      
 female')
 FemalePop = data[femalepop_filter]

 Pop_total=data['IndicatorName'].str.contains('Population, total')
 Pop_Tot=data[Pop_total] 

 hist_indicator = 'GDP per capita \(const 2005'
 GDP_Filter = data['IndicatorName'].str.contains(hist_indicator)
 GDPValues=data[GDP_Filter]

 c1 = (FemalePop['CountryCode']) 
 c2 = (GDPValues['CountryCode']) 
 c3 = (Pop_Tot['CountryCode'])
 c4 = np.intersect1d(c1,c2)
 c5 = np.intersect1d(c3,c4)

I captured the country codes for all the parameters. Now I got their intersection in c5. Can someone help me how I can get the data where countrycodes are in c5?

Upvotes: 2

Views: 148

Answers (3)

noob
noob

Reputation: 3811

I found the answer.

    data2=data[data['CountryCode'].isin(c5)]
    #Getting all the intersection of country codes in one dataset

    data2['concatyearandCC'] = data2["CountryCode"] + "" + data2["Year"].map(str)
    #Introducing new column which is concatenation of country code and Year so that I 
    #get all the rows corresponding to same year and country code.

    c9 = pd.merge(FemalePop2,Pop_Tot2,on="concatyearandCC")
    c10= pd.merge(c9,GDPValues2,on="concatyearandCC")
    #Merging datasets containing female population%, GDP and total population of  
    #females so that I can calculate absolute number of females.

    c10.rename(columns={'Value_x': 'Population_female%', 'Value_y': 'Population 
    Total', 'Value': 'GDP Per capita'}, inplace=True)
    #Renaming some columns for ease.

    c10_Final['Abs_Female_Pop'] = c10_Final['Population_female%'] 
    *c10_Final['Population Total']
    #Finding absolute female population

Upvotes: 1

Kumar Sourav
Kumar Sourav

Reputation: 419

try something like data[data['CountryCode'].isin(c5)]

Upvotes: 1

The error is telling you that Python doesn't know how to concatenate ("&") a string and a boolean variable.

Transform the bool to a string and your concatenation should work.

In general, debug your code step by step. First look what the variables contain. You can use Python's "pretty print" (pprint) module for that. That lets you print out all kinds of variables for you to see what they contain.

Upvotes: 0

Related Questions