Reputation: 777

How do I perform ordered selection on multiple Columns by Value

I have a dataframe including a month and year column. Both contain strings i.e. 'September' and '2013'. How do I select all rows between September 2013 and May 2008 in one row?

df1 = stats_month_census_2[(stats_month_census_2['year'] <= '2013')
                 & (stats_month_census_2['year'] >= '2008')]

df2 = df1[...]

After the code above, I was going to do the same thing again but I am having a hard time coming up with clever code to simply get rid of rows that are higher in time than September 2013 ('October to December') and below May 2008. I could hard code this easily, but there must be a more pythonic way of doing this...

Upvotes: 4

Answers (4)

Karn Kumar

Reputation: 8826

Or you could try below if you are looking for rows falls between 2008 to 2013 as you asked in the post "select all rows between September 2013 and May 2008" then use pandas.Series.between:

Dataset borrowed from @jezrael..

DataFrame for Demonstration purpose:

>>> stats_month_census_2
   year      month  data
0  2008      April     1
1  2008        May     3
2  2008       June     4
3  2013  September     6
4  2013    October     5
5  2014   November     6
6  2014   December     7

Using pandas.Series.between()

>>> stats_month_census_2[stats_month_census_2['year'].between(2008, 2013, inclusive=True)]
   year      month  data
0  2008      April     1
1  2008        May     3
2  2008       June     4
3  2013  September     6
4  2013    October     5

If it's just a matter of datetime format, you can simply try below:

>>> stats_month_census_2[stats_month_census_2['year'].between('2008-05', '2013-09', inclusive=True)]
        year      month  data
1 2008-05-01        May     3
2 2008-06-01       June     4
3 2013-09-01  September     6

Using DataFame.query :

>>> stats_month_census_2.query('"2008-05" <= year <= "2013-09"')
        year      month  data
1 2008-05-01        May     3
2 2008-06-01       June     4
3 2013-09-01  September     6

Using isin method: Select the rows between two dates

>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05-01', '2013-09-01'))]
        year      month  data
1 2008-05-01        May     3
2 2008-06-01       June     4
3 2013-09-01  September     6

Or, even you can pass like below..

>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05', '2013-09'))]
        year      month  data
1 2008-05-01        May     3
2 2008-06-01       June     4
3 2013-09-01  September     6

Using loc method by slicing off based on Index start and end dates..

Start = stats_month_census_2[stats_month_census_2['year'] =='2008-05'].index[0]
End = stats_month_census_2[stats_month_census_2['year']=='2013-09'].index[0]

>>> stats_month_census_2.loc[Start:End]
        year      month  data
1 2008-05-01        May     3
2 2008-06-01       June     4
3 2013-09-01  September     6

Note: Just for the curiosity as @jezrael asked in comment, i'm adding how to convert the year column into datetime format:

As we have the below example DataFrame where we have two distinct columns year and month where year column has only years and month column is in literal string format So, First we need to convert the String into an int form join or add the year & month together by assign a day as 1 for all using pandas pd.to_datetime method.

df
   year      month  data
0  2008      April     1
1  2008        May     3
2  2008       June     4
3  2013  September     6
4  2013    October     5
5  2014   November     6
6  2014   December     7

Above is the raw DataFrame before datetime conversion So, i'm taking the below approach which i learned over the time vi SO itself.

1- First convert the month names into int form and assign it to a new column called Month as an easy go So, we can use that for conversion later.

df['Month'] = pd.to_datetime(df.month, format='%B').dt.month

2- Secondly, or at last convert Directly the year column into a proper datetime format by directly assigning to year column itself it's a kind of inplace we can say.

df['Date'] = pd.to_datetime(df[['year', 'Month']].assign(Day=1))

Now the Desired DataFrame and year column is in datetime Form:

print(df)
        year      month  data  Month
0 2008-04-01      April     1      4
1 2008-05-01        May     3      5
2 2008-06-01       June     4      6
3 2013-09-01  September     6      9
4 2013-10-01    October     5     10
5 2014-11-01   November     6     11
6 2014-12-01   December     7     12

Upvotes: 3

anky

Reputation: 75100

Another solution:

Lets assume the df looks like below:

    series      name       Month     Year
0   fertility   rate       May       2008
1   CO2         emissions  June      2009
2   fertility   rate       September 2013
3   fertility   rate       October   2013
4   CO2         emissions  December  2014

Create a calender dictionary mapping and save in a new column

import calendar
d = dict((v,k) for k,v in enumerate(calendar.month_abbr))
stats_month_census_2['month_int'] = stats_month_census_2.Month.apply(lambda x: x[:3]).map(d)

>>stats_month_census_2

    series      name      Month     Year    month_int
0   fertility   rate      May       2008    5
1   CO2         emissions June      2009    6
2   fertility   rate      September 2013    9
3   fertility   rate      October   2013    10
4   CO2         emissions December  2014    12

Filter using series.between()

stats_month_census_2[stats_month_census_2.month_int.between(5,9,inclusive=True) & stats_month_census_2.Year.between(2008,2013,inclusive=True)]

Output:

    series      name      Month      Year   month_int
0   fertility   rate      May        2008   5
1   CO2         emissions June       2009   6
2   fertility   rate      September  2013   9

Upvotes: 2

jezrael

Reputation: 863166

You can create DatetimeIndex and then select by partial string indexing:

stats_month_census_2 = pd.DataFrame({
    'year': [2008, 2008, 2008, 2013,2013],
    'month': ['April','May','June','September','October'],
    'data':[1,3,4,6,5]
})
print (stats_month_census_2)
   year      month  data
0  2008      April     1
1  2008        May     3
2  2008       June     4
3  2013  September     6
4  2013    October     5

s = stats_month_census_2.pop('year').astype(str) + stats_month_census_2.pop('month')
#if need year and month columns
#s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2.index = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
            data
2008-04-01     1
2008-05-01     3
2008-06-01     4
2013-09-01     6
2013-10-01     5

print (stats_month_census_2['2008':'2013'])
            data
2008-04-01     1
2008-05-01     3
2008-06-01     4
2013-09-01     6
2013-10-01     5    

print (stats_month_census_2['2008-05':'2013-09'])
            data
2008-05-01     3
2008-06-01     4
2013-09-01     6

Or create column and use between with boolean indexing:

s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2['date'] = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
   year      month  data       date
0  2008      April     1 2008-04-01
1  2008        May     3 2008-05-01
2  2008       June     4 2008-06-01
3  2013  September     6 2013-09-01
4  2013    October     5 2013-10-01

df = stats_month_census_2[stats_month_census_2['date'].between('2008-05', '2013-09')]
print (df)
   year      month  data       date
1  2008        May     3 2008-05-01
2  2008       June     4 2008-06-01
3  2013  September     6 2013-09-01

Unfortunately this way with datetime column is not possible for select betwen years, then need pygo solution with year column:

#wrong output
df = stats_month_census_2[stats_month_census_2['date'].between('2008', '2013')]
print (df)

   year  month  data       date
0  2008  April     1 2008-04-01
1  2008    May     3 2008-05-01
2  2008   June     4 2008-06-01

Upvotes: 2

Ananay Mital

Reputation: 1475

You can easily convert the columns into a DateTime column using pd.to_datetime

>>df    
    month       year
0   January     2000
1   April       2001
2   July        2002
3   February    2010
4   February    2018
5   March       2014
6   June        2012
7   June        2011
8   May         2009
9   November    2016


>>df['date'] = pd.to_datetime(df['month'].astype(str) + '-' + df['year'].astype(str), format='%B-%Y')
>>df

        month   year    date
0   January     2000    2000-01-01
1   April       2001    2001-04-01
2   July        2002    2002-07-01
3   February    2010    2010-02-01
4   February    2018    2018-02-01
5   March       2014    2014-03-01
6   June        2012    2012-06-01
7   June        2011    2011-06-01
8   May         2009    2009-05-01
9   November    2016    2016-11-01

>>df[(df.date <= "2013-09") & (df.date >= "2008-05") ]
       month    year    date
3   February    2010    2010-02-01
6   June        2012    2012-06-01
7   June        2011    2011-06-01
8   May         2009    2009-05-01

Upvotes: 3

How do I perform ordered selection on multiple Columns by Value

Answers (4)

Related Questions