user4679594
user4679594

Reputation:

Python - Subset a dataset using two column criteria

I tried to subset a dataset using two column criteria, but my result produces an error. Any idea(s) why? This is my code:

df[(df['locations'] = 'New York City Metro Area') & (2016-09-01 < df['publication_date'] < 2016-09-30 )]

This is my error:

    f = lambda x, y: lib.ismember(x, set(values))
  File "pandas\lib.pyx", line 158, in pandas.lib.ismember (pandas\lib.c:5199)
TypeError: unhashable type: 'list'

If it helps, my data looks something like this:

df['publication_date'].head()

0    2017-01-30T04:48:11.929095Z
1           2016-11-15T05:30:03Z
2    2017-01-30T04:45:24.861067Z
3    2017-01-30T04:47:41.419255Z
4    2017-01-30T04:49:36.192148Z
Name: publication_date, dtype: object

df['locations'].head()

0      [{'name': 'Kansas City, MO'}]
1         [{'name': 'Evanston, IL'}]
2         [{'name': 'Stamford, CT'}]
3             [{'name': 'Reno, NV'}]
4    [{'name': 'Boston Metro Area'}]
Name: locations, dtype: object

Upvotes: 1

Views: 285

Answers (1)

jezrael
jezrael

Reputation: 862701

I think you can extract key name from each dict first and then convert string to_datetime. Last use boolean indexing with between:

df = pd.DataFrame({'locations':[[{'name': 'Kansas City, MO'}], [{'name': 'Evanston, IL'}], [{'name': 'Stamford, CT'}],[{'name': 'Reno, NV'}],[{'name': 'Boston Metro Area'}]],
                   'publication_date':['2017-01-30T04:48:11.929095Z','2016-11-15T05:30:03Z','2017-01-30T04:45:24.861067Z','2017-01-30T04:47:41.419255Z','2017-01-30T04:49:36.192148Z']})
print (df)
                         locations             publication_date
0    [{'name': 'Kansas City, MO'}]  2017-01-30T04:48:11.929095Z
1       [{'name': 'Evanston, IL'}]         2016-11-15T05:30:03Z
2       [{'name': 'Stamford, CT'}]  2017-01-30T04:45:24.861067Z
3           [{'name': 'Reno, NV'}]  2017-01-30T04:47:41.419255Z
4  [{'name': 'Boston Metro Area'}]  2017-01-30T04:49:36.192148Z

print (type(df.locations.iloc[0]))
<class 'list'>


df.locations = df.locations.apply(lambda x: x[0]['name'])
df.publication_date = pd.to_datetime(df.publication_date)
print (df)
           locations           publication_date
0    Kansas City, MO 2017-01-30 04:48:11.929095
1       Evanston, IL 2016-11-15 05:30:03.000000
2       Stamford, CT 2017-01-30 04:45:24.861067
3           Reno, NV 2017-01-30 04:47:41.419255
4  Boston Metro Area 2017-01-30 04:49:36.192148

print (df[(df['locations'] == 'Boston Metro Area')  & 
          (df['publication_date'].between('2016-09-01', '2018-09-30'))])
           locations           publication_date
4  Boston Metro Area 2017-01-30 04:49:36.192148

Solution with query:

print (df.query('locations ==  "Boston Metro Area" and  "2016-09-01" < publication_date < "2018-09-30"'))
           locations           publication_date
4  Boston Metro Area 2017-01-30 04:49:36.192148

If dont need change structure of values in column locations:

df.publication_date = pd.to_datetime(df.publication_date)
print (df)
                         locations           publication_date
0    [{'name': 'Kansas City, MO'}] 2017-01-30 04:48:11.929095
1       [{'name': 'Evanston, IL'}] 2016-11-15 05:30:03.000000
2       [{'name': 'Stamford, CT'}] 2017-01-30 04:45:24.861067
3           [{'name': 'Reno, NV'}] 2017-01-30 04:47:41.419255
4  [{'name': 'Boston Metro Area'}] 2017-01-30 04:49:36.192148

print (df[(df['locations'].apply(lambda x: x[0]['name']) == 'Boston Metro Area')  & 
          (df['publication_date'].between('2016-09-01', '2018-09-30'))])

                         locations           publication_date
4  [{'name': 'Boston Metro Area'}] 2017-01-30 04:49:36.192148

Upvotes: 1

Related Questions