Reputation: 101
I have a dataset showing yearly growth indicators for more than a 100 countries, from the year 1970 until 2013. Not all the countries have data for all the years, the country with the least years having 30 years of data. I want to level things out and have all countries show me 30 years of data, removing years from countries that have >30. I'm providing an example below.
I thought about using loops to remove data from the dataframe until all countries appear 30 times and then build a whole new dataframe, but I like to believe there's a better solution.
import pandas as pd
data = {'Country':['Israel','Congo','Denmark',
'Israel','Denmark',
'Israel','Congo',
'Israel','Congo','Denmark'],
'Year':[2000,2000,2000,
2001,2001,
2002,2002,
2003,2003,2003],
'Value':[2.5,1.2,3.1,2.8,1.1,2.9,3.1,1.9,3.0,3.1]}
df = pd.DataFrame(data=data)
df
Country Year Value
0 Israel 2000 2.5
1 Congo 2000 1.2
2 Denmark 2000 3.1
3 Israel 2001 2.8
4 Denmark 2001 1.1
5 Israel 2002 2.9
6 Congo 2002 3.1
7 Israel 2003 1.9
8 Congo 2003 3.0
9 Denmark 2003 3.1
The code above creates a dataframe with an example using just 3 countries and 4 years. From the dataframe, you can see Israel has 4 years worth of data while Denmark and Congo have only three. I want to remove a year from Israel so that all countries have 3 years. In the real dataframe I want to remove years from countries that have more than 30 years so that all countries have the same years, preferably remove the year with the least Value.
Here's my solution using for loops, which uses a lot of lines of code:
gp = df.groupby('Country').groups #Group by country name
d = {} #Build dictionary Country Name => index list.
for i in gp: #Iterate over all countries until a list of 3 indeces is
#reached for each country.
d[i] = []
for j in gp[i]:
if len(d[i])<3: #A country appears once every year in the dataset,
#3 means 3 years. If a country appears more than 3 times, it will only
#include the indices of the first 3 occurrences.
d[i].append(j)
indeces = [] #Gather the indeces to keep in the dataframe.
for i in d:
for j in d[i]:
if len(d[i])==3: #make sure the list has exactly 3 items
indeces.append(j)
final_df = df.loc[indeces,['Country','Year','Value']]
final_df
#Now I have one less value for Israel, so all countries have 3 values.
Country Year Value
1 Congo 2000 1.2
6 Congo 2002 3.1
8 Congo 2003 3.0
2 Denmark 2000 3.1
4 Denmark 2001 1.1
9 Denmark 2003 3.1
0 Israel 2000 2.5
3 Israel 2001 2.8
5 Israel 2002 2.9
Upvotes: 3
Views: 1192
Reputation: 101
This is my solution using Pandas. It did what it had to do even though it uses many lines of code. Thanks to @Vaishali for the help:
threshold = 3 #Anything that occurs less than this will be removed,
#if it ocurrs more, the extra ocurrences with the least values
#will be removed.
newIndex = df.set_index('Country')#set new index to make selection by
#index posible.
values = newIndex.index.value_counts() #Count occurrences of index values.
to_keep = values[values>=threshold].index.values
#Keep index values that ocurr >= threshold.
rank_df = newIndex.loc[to_keep,['Value','Year']]#Select rows and
#columns to keep.
#Sort values in descending order before meeting threshold.
rank_df = rank_df.sort_values('Value',ascending=False)
rank_df = rank_df.groupby(rank_df.index).head(threshold)#group again
#Since values are sorted, head() will show highest values
rank_df = rank_df.groupby([rank_df.index,'Year']).mean() \
.sort_values('Value',ascending=False)
#Finally, reset index to convert Year index into a column, and sort by year
rank_df.reset_index(level=1).sort_values('Year')
Output:
Year Value
Country
Denmark 2000 3.1
Israel 2000 2.5
Congo 2000 1.2
Israel 2001 2.8
Denmark 2001 1.1
Congo 2002 3.1
Israel 2002 2.9
Denmark 2003 3.1
Congo 2003 3.0
Upvotes: 1
Reputation: 38415
You can create a list of recent years from the unique values in year column and use Boolean indexing to index dataframe using that list.
recent_years = df.Year.unique()[-3:]
df[df.Year.isin(recent_years)]
Country Year Value
3 Israel 2001 2.8
4 Denmark 2001 1.1
5 Israel 2002 2.9
6 Congo 2002 3.1
7 Israel 2003 1.9
8 Congo 2003 3.0
9 Denmark 2003 3.1
If your Year values are not necessarily in order, use numpy unique which returns a sorted array unlike pandas unique()
recent_years = np.unique(df.Year)[-3:]
df[df.Year.isin(recent_years)]
Here is another solution which return 3 latest years for each country. If the data is not sorted by year, it needs to be sorted first.
idx = df.groupby('Country').apply(lambda x: x['Year'].tail(3)).index
df.set_index(['Country', df.index]).reindex(idx).reset_index().drop('level_1', 1)
Country Year Value
0 Congo 2000 1.2
1 Congo 2002 3.1
2 Congo 2003 3.0
3 Denmark 2000 3.1
4 Denmark 2001 1.1
5 Denmark 2003 3.1
6 Israel 2001 2.8
7 Israel 2002 2.9
8 Israel 2003 1.9
If the data is not sorted, first sort it using
df = df.sort_values(by = 'Year')
Upvotes: 3