KatieRose1029
KatieRose1029

Reputation: 185

Filtering Values in Pandas Dataframe or Series

I am trying to filter values from a column in a pandas Dataframe but I seem to be receiving booleans instead of actual values. I am trying to filter our data by month and year. In the code following below you will see I am only filtering out by year but I have tried month and year multiple times in different ways:

    In [1]: import requests

    In [2]: import pandas as pd # pandas

    In [3]: import datetime as dt # module for manipulating dates and times

    In [4]: url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"

    In [5]: source = requests.get(url).text

    In [6]: from io import StringIO, BytesIO

    In [7]: s = StringIO(source)

    In [8]: election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(convert_dates="coerce", convert_numeric=True)


    In [9]: election_data.head(n=3)
    Out[9]:
                Pollster Start Date   End Date Entry Date/Time (ET)  \
    0  Politico/GWU/Battleground 2012-11-04 2012-11-05  2012-11-06 08:40:26
    1           YouGov/Economist 2012-11-03 2012-11-05  2012-11-26 15:31:23
    2           Gravis Marketing 2012-11-03 2012-11-05  2012-11-06 09:22:02

       Number of Observations     Population             Mode  Obama  Romney  \
    0                  1000.0  Likely Voters       Live Phone   47.0    47.0
    1                   740.0  Likely Voters         Internet   49.0    47.0
    2                   872.0  Likely Voters  Automated Phone   48.0    48.0

       Undecided  Other                                       Pollster URL  \
    0        6.0    NaN  http://elections.huffingtonpost.com/pollster/p...
    1        3.0    NaN  http://elections.huffingtonpost.com/pollster/p...
    2        4.0    NaN  http://elections.huffingtonpost.com/pollster/p...

                                              Source URL     Partisan Affiliation  \
    0  http://www.politico.com/news/stories/1112/8338...  Nonpartisan        None
    1  http://cdn.yougov.com/cumulus_uploads/document...  Nonpartisan        None
    2  http://www.gravispolls.com/2012/11/gravis-mark...  Nonpartisan        None

       Question Text  Question Iteration
    0            NaN                   1
    1            NaN                   1
    2            NaN                   1

    In [10]: start_date = pd.Series(election_data["Start Date"])
        ...: start_date.head(n=3)
        ...:
    Out[10]:
    0   2012-11-04
    1   2012-11-03
    2   2012-11-03
    Name: Start Date, dtype: datetime64[ns]

    In [11]: filtered = start_date.map(lambda x: x.year == 2012)

    In [12]: filtered
    Out[12]:
    0       True
    1       True
    2       True
    ...
    587    False
    588    False
    589    False
    Name: Start Date, dtype: bool

Upvotes: 1

Views: 362

Answers (2)

jezrael
jezrael

Reputation: 862661

I think you need read_csv with url address first and then boolean indexing with mask created by year and month:

election_data = pd.read_csv('http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv', parse_dates=[1,2,3])

print (election_data.head(3))
                    Pollster Start Date   End Date Entry Date/Time (ET)  \
0  Politico/GWU/Battleground 2012-11-04 2012-11-05  2012-11-06 08:40:26   
1           YouGov/Economist 2012-11-03 2012-11-05  2012-11-26 15:31:23   
2           Gravis Marketing 2012-11-03 2012-11-05  2012-11-06 09:22:02   

   Number of Observations     Population             Mode  Obama  Romney  \
0                  1000.0  Likely Voters       Live Phone   47.0    47.0   
1                   740.0  Likely Voters         Internet   49.0    47.0   
2                   872.0  Likely Voters  Automated Phone   48.0    48.0   

   Undecided  Other                                       Pollster URL  \
0        6.0    NaN  http://elections.huffingtonpost.com/pollster/p...   
1        3.0    NaN  http://elections.huffingtonpost.com/pollster/p...   
2        4.0    NaN  http://elections.huffingtonpost.com/pollster/p...   

                                          Source URL     Partisan Affiliation  \
0  http://www.politico.com/news/stories/1112/8338...  Nonpartisan        None   
1  http://cdn.yougov.com/cumulus_uploads/document...  Nonpartisan        None   
2  http://www.gravispolls.com/2012/11/gravis-mark...  Nonpartisan        None   

   Question Text  Question Iteration  
0            NaN                   1  
1            NaN                   1  
2            NaN                   1  
print (election_data.dtypes)
Pollster                          object
Start Date                datetime64[ns]
End Date                  datetime64[ns]
Entry Date/Time (ET)      datetime64[ns]
Number of Observations           float64
Population                        object
Mode                              object
Obama                            float64
Romney                           float64
Undecided                        float64
Other                            float64
Pollster URL                      object
Source URL                        object
Partisan                          object
Affiliation                       object
Question Text                    float64
Question Iteration                 int64
dtype: object


election_data[election_data["Start Date"].dt.year == 2012]

election_data[(election_data["Start Date"].dt.year == 2012) & (election_data["Start Date"].dt.month== 10)]

Upvotes: 3

piRSquared
piRSquared

Reputation: 294258

you can use pandas date filtering if you make Start Date the index

get all of 2012
election_data.set_index('Start Date')['2012']

get all of Jan, 2012
election_data.set_index('Start Date')['2012-01']

get all between Jan 1, 2012 and Jan 13, 2012
election_data.set_index('Start Date')['2012-01-01':'2012-01-13]

enter image description here

Upvotes: 3

Related Questions