Reputation: 3
I build a Dataframe from an sql query (data for may 2014 to june 2015) and try to build 2 different data sets - train uses all data except June 2015 - test uses only data for June 2015 When I try to use : train= df[(df.month!=6) & (df.year!=2015)]
It seems that I am using OR instead of AND, because I do not get any values for month=6 (also not for 2014) and no values for year=2015 at all, so also not for other months in 2015.
I do not understand what is wrong with the code.
mycurs.execute("""SELECT day, month, year, cloudcover, moonphase, precipintensity,
precipaccumulation, preciptype2, humidity, pressure, windspeed,
uvindex, visibility, temperaturehigh, weekend_2, crimecount
FROM open.nyc_weatherhist_2""")
f=mycurs.fetchall() #returns tuple with first row (unordered list)
df=pd.DataFrame(f, columns=feature_names)
print(df)
day month year ... temperaturehigh weekend crimecount
0 28 4 2015 ... 20.85 0 56
1 14 4 2015 ... 18.25 0 103
2 13 7 2014 ... 27.44 0 89
3 4 1 2015 ... 12.94 0 99
4 21 9 2014 ... 24.15 0 66
.. ... ... ... ... ... ... ...
390 4 7 2014 ... 23.37 1 84
391 8 8 2014 ... 27.98 1 97
392 26 4 2015 ... 15.78 0 82
393 3 8 2014 ... 24.50 0 80
394 5 6 2015 ... 20.65 1 87
[395 rows x 16 columns]
train= df[(df.month!=6) & (df.year!=2015)]
print(train)
day month year ... temperaturehigh weekend crimecount
2 13 7 2014 ... 27.44 0 89
4 21 9 2014 ... 24.15 0 66
8 10 11 2014 ... 16.27 0 76
9 5 11 2014 ... 17.76 0 101
11 10 7 2014 ... 28.06 0 99
.. ... ... ... ... ... ... ...
382 10 8 2014 ... 30.51 0 119
389 21 11 2014 ... 2.65 1 110
390 4 7 2014 ... 23.37 1 84
391 8 8 2014 ... 27.98 1 97
393 3 8 2014 ... 24.50 0 80
[184 rows x 16 columns]
Upvotes: 0
Views: 56
Reputation: 492
You need to use | for OR as suspected.
>>> df = pd.DataFrame({'year':[2014,2015,2015],'month':[6,5,6]})
>>> df
year month
0 2014 6
1 2015 5
2 2015 6
>>> train = df[(df['year']!=2015) | (df['month']!=6)]
>>> train
year month
0 2014 6
1 2015 5
Upvotes: 1
Reputation: 1246
Just elaborating on Francis answer, you're looking at the condition in the wrong manner.
You want all the data except for a certain month in a certain year. So, when would the value be "right"? If either of them is not in the month or year.
So, this is exactly the OR
condition.
Another way to look at it - you have conditions X
and Y
.
You're "wrong" when both X
and Y
exist, or simply X & Y
.
So, when are you "right"? When !(X & Y)
== !X | !Y
.
It's your choice how you tackle this, but you can do something like:
train = df[(df.month != 6) | (df.year != 2015)]
or
train = df[~((df.month == 6) & (df.year == 2015))]
Which are equivalent
Upvotes: 2