Reputation: 777
I have a dataframe including a month and year column. Both contain strings i.e. 'September' and '2013'. How do I select all rows between September 2013 and May 2008 in one row?
df1 = stats_month_census_2[(stats_month_census_2['year'] <= '2013')
& (stats_month_census_2['year'] >= '2008')]
df2 = df1[...]
After the code above, I was going to do the same thing again but I am having a hard time coming up with clever code to simply get rid of rows that are higher in time than September 2013 ('October to December') and below May 2008. I could hard code this easily, but there must be a more pythonic way of doing this...
Upvotes: 4
Views: 104
Reputation: 8826
Or you could try below if you are looking for rows falls between 2008 to 2013 as you asked in the post "select all rows between September 2013 and May 2008" then use pandas.Series.between:
Dataset borrowed from @jezrael..
DataFrame for Demonstration purpose:
>>> stats_month_census_2
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
5 2014 November 6
6 2014 December 7
Using pandas.Series.between()
>>> stats_month_census_2[stats_month_census_2['year'].between(2008, 2013, inclusive=True)]
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
If it's just a matter of datetime
format, you can simply try below:
>>> stats_month_census_2[stats_month_census_2['year'].between('2008-05', '2013-09', inclusive=True)]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Using DataFame.query :
>>> stats_month_census_2.query('"2008-05" <= year <= "2013-09"')
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Using isin method: Select the rows between two dates
>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05-01', '2013-09-01'))]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Or, even you can pass like below..
>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05', '2013-09'))]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Using loc
method by slicing off based on Index start and end dates..
Start = stats_month_census_2[stats_month_census_2['year'] =='2008-05'].index[0]
End = stats_month_census_2[stats_month_census_2['year']=='2013-09'].index[0]
>>> stats_month_census_2.loc[Start:End]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Note: Just for the curiosity as @jezrael asked in comment, i'm adding how to convert the year
column into datetime format:
As we have the below example DataFrame where we have two distinct columns year
and month
where year column has only years and month column is in literal string format So, First we need to convert the String into an int form join or add the year & month together by assign a day as 1 for all using pandas pd.to_datetime
method.
df
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
5 2014 November 6
6 2014 December 7
Above is the raw DataFrame before datetime conversion So, i'm taking the below approach which i learned over the time vi SO itself.
1- First convert the month
names into int form and assign it to a new column called Month
as an easy go So, we can use that for conversion later.
df['Month'] = pd.to_datetime(df.month, format='%B').dt.month
2- Secondly, or at last convert Directly the year column into a proper datetime
format by directly assigning to year
column itself it's a kind of inplace we can say.
df['Date'] = pd.to_datetime(df[['year', 'Month']].assign(Day=1))
Now the Desired DataFrame and
year
column is in datetime Form:
print(df)
year month data Month
0 2008-04-01 April 1 4
1 2008-05-01 May 3 5
2 2008-06-01 June 4 6
3 2013-09-01 September 6 9
4 2013-10-01 October 5 10
5 2014-11-01 November 6 11
6 2014-12-01 December 7 12
Upvotes: 3
Reputation: 75100
Another solution:
Lets assume the df looks like below:
series name Month Year
0 fertility rate May 2008
1 CO2 emissions June 2009
2 fertility rate September 2013
3 fertility rate October 2013
4 CO2 emissions December 2014
Create a calender dictionary mapping and save in a new column
import calendar
d = dict((v,k) for k,v in enumerate(calendar.month_abbr))
stats_month_census_2['month_int'] = stats_month_census_2.Month.apply(lambda x: x[:3]).map(d)
>>stats_month_census_2
series name Month Year month_int
0 fertility rate May 2008 5
1 CO2 emissions June 2009 6
2 fertility rate September 2013 9
3 fertility rate October 2013 10
4 CO2 emissions December 2014 12
Filter using series.between()
stats_month_census_2[stats_month_census_2.month_int.between(5,9,inclusive=True) & stats_month_census_2.Year.between(2008,2013,inclusive=True)]
Output:
series name Month Year month_int
0 fertility rate May 2008 5
1 CO2 emissions June 2009 6
2 fertility rate September 2013 9
Upvotes: 2
Reputation: 863166
You can create DatetimeIndex
and then select by partial string indexing
:
stats_month_census_2 = pd.DataFrame({
'year': [2008, 2008, 2008, 2013,2013],
'month': ['April','May','June','September','October'],
'data':[1,3,4,6,5]
})
print (stats_month_census_2)
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
s = stats_month_census_2.pop('year').astype(str) + stats_month_census_2.pop('month')
#if need year and month columns
#s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2.index = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
data
2008-04-01 1
2008-05-01 3
2008-06-01 4
2013-09-01 6
2013-10-01 5
print (stats_month_census_2['2008':'2013'])
data
2008-04-01 1
2008-05-01 3
2008-06-01 4
2013-09-01 6
2013-10-01 5
print (stats_month_census_2['2008-05':'2013-09'])
data
2008-05-01 3
2008-06-01 4
2013-09-01 6
Or create column and use between
with boolean indexing
:
s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2['date'] = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
year month data date
0 2008 April 1 2008-04-01
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
3 2013 September 6 2013-09-01
4 2013 October 5 2013-10-01
df = stats_month_census_2[stats_month_census_2['date'].between('2008-05', '2013-09')]
print (df)
year month data date
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
3 2013 September 6 2013-09-01
Unfortunately this way with datetime column is not possible for select betwen years, then need pygo
solution with year
column:
#wrong output
df = stats_month_census_2[stats_month_census_2['date'].between('2008', '2013')]
print (df)
year month data date
0 2008 April 1 2008-04-01
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
Upvotes: 2
Reputation: 1475
You can easily convert the columns into a DateTime column using pd.to_datetime
>>df
month year
0 January 2000
1 April 2001
2 July 2002
3 February 2010
4 February 2018
5 March 2014
6 June 2012
7 June 2011
8 May 2009
9 November 2016
>>df['date'] = pd.to_datetime(df['month'].astype(str) + '-' + df['year'].astype(str), format='%B-%Y')
>>df
month year date
0 January 2000 2000-01-01
1 April 2001 2001-04-01
2 July 2002 2002-07-01
3 February 2010 2010-02-01
4 February 2018 2018-02-01
5 March 2014 2014-03-01
6 June 2012 2012-06-01
7 June 2011 2011-06-01
8 May 2009 2009-05-01
9 November 2016 2016-11-01
>>df[(df.date <= "2013-09") & (df.date >= "2008-05") ]
month year date
3 February 2010 2010-02-01
6 June 2012 2012-06-01
7 June 2011 2011-06-01
8 May 2009 2009-05-01
Upvotes: 3