Reputation: 93
As requested, here is a minimal reproducable example that will generate the issue of .isin() not dropping the values not in .isin() but just setting them to zero:
import os
import pandas as pd
df_example = pd.DataFrame({'Requesting as': {0: 'Employee', 1: 'Ex- Employee', 2: 'Employee', 3: 'Employee', 4: 'Ex-Employee', 5: 'Employee', 6: 'Employee', 7: 'Employee', 8: 'Ex-Employee', 9: 'Ex-Employee', 10: 'Employee', 11: 'Employee', 12: 'Ex-Employee', 13: 'Ex-Employee', 14: 'Employee', 15: 'Employee', 16: 'Employee', 17: 'Ex-Employee', 18: 'Employee', 19: 'Employee', 20: 'Ex-Employee', 21: 'Employee', 22: 'Employee', 23: 'Ex-Employee', 24: 'Employee', 25: 'Employee', 26: 'Ex-Employee', 27: 'Employee', 28: 'Employee', 29: 'Ex-Employee', 30: 'Employee', 31: 'Employee', 32: 'Ex-Employee', 33: 'Employee', 34: 'Employee', 35: 'Ex-Employee', 36: 'Employee', 37: 'Employee', 38: 'Ex-Employee', 39: 'Employee', 40: 'Employee'}, 'Years of service': {0: -0.4, 1: -0.3, 2: -0.2, 3: 1.0, 4: 1.0, 5: 1.0, 6: 2.0, 7: 2.0, 8: 2.0, 9: 2.0, 10: 3.0, 11: 3.0, 12: 3.0, 13: 4.0, 14: 4.0, 15: 4.0, 16: 5.0, 17: 5.0, 18: 5.0, 19: 5.0, 20: 6.0, 21: 6.0, 22: 6.0, 23: 11.0, 24: 11.0, 25: 11.0, 26: 16.0, 27: 17.0, 28: 18.0, 29: 21.0, 30: 22.0, 31: 23.0, 32: 26.0, 33: 27.0, 34: 28.0, 35: 31.0, 36: 32.0, 37: 33.0, 38: 35.0, 39: 36.0, 40: 37.0}, 'yos_bins': {0: 0, 1: 0, 2: 0, 3: '0-1', 4: '0-1', 5: '0-1', 6: '1-2', 7: '1-2', 8: '1-2', 9: '1-2', 10: '2-3', 11: '2-3', 12: '2-3', 13: '3-4', 14: '3-4', 15: '3-4', 16: '4-5', 17: '4-5', 18: '4-5', 19: '4-5', 20: '5-6', 21: '5-6', 22: '5-6', 23: '10-15', 24: '10-15', 25: '10-15', 26: '15-20', 27: '15-20', 28: '15-20', 29: '20-40', 30: '20-40', 31: '20-40', 32: '20-40', 33: '20-40', 34: '20-40', 35: '20-40', 36: '20-40', 37: '20-40', 38: '20-40', 39: '20-40', 40: '20-40'}})
cut_labels = ['0-1','1-2', '2-3', '3-4', '4-5', '5-6', '6-10', '10-15', '15-20', '20-40']
cut_bins = (0, 1, 2, 3, 4, 5, 6, 10, 15, 20, 40)
df_example['yos_bins'] = pd.cut(df_example['Years of service'], bins=cut_bins, labels=cut_labels)
print(df_example['yos_bins'].value_counts())
print(len(df_example['yos_bins']))
print(len(df_example))
print(df_example['yos_bins'].value_counts())
test = df_example[df_example['yos_bins'].isin(['0-1', '1-2', '2-3'])]
print('test dataframe:\n',test)
print('\n')
print('test value counts of yos_bins:\n', test['yos_bins'].value_counts())
print('\n')
dic_test = test.to_dict()
print(dic_test)
print('\n')
print(test.value_counts())ervr
I have created bins for a column with "years of service":
cut_labels = ['0-1','1-2', '2-3', '3-4', '4-5', '5-6', '6-10', '10-15', '15-20', '20-40']
cut_bins = (0, 1, 2, 3, 4, 5, 6, 10, 15, 20, 40)
df['yos_bins'] = pd.cut(df['Years of service'], bins=cut_bins, labels=cut_labels)
Then I applied .isin() to the dataframe column called 'yos_bins' with the intention to filter for a selection of column values. Excerpt from column in df.
The column I use to slice is called 'yos_bins' (i.e. binned Years of Service). I want to select only 3 ranges (0-1, 1-2, 2-3 years), but apparently there are more ranges included in the column.
To my surprise, when I apply value_counts(), I still get all values of the yos_bins column from the df dataframe (but with 0 counts).
test.yos_bins.value_counts()
Looks like this:
This was not intended, all other bins except the 3 in isin() should have been dropped. The resulting issue is that the 0 values are shown in sns.countplots, so I end up with undesired columns with zero counts.
When I save the df to_excel(), all "10-15" value fields show a "Text Date with 2-Digit Year" error. I do not load that dataframe back into python, so not sure if this could cause the problem?
Does anybody know how I can create the test dataframe that merely consists of the 3 yos_bins values instead of showing all yos_bins values, but some with zeros?
Upvotes: 2
Views: 266
Reputation: 170
An ugly solution because numpy and pandas are misfeatured in terms of element-wise "is in". In my experience I do the comparison manually with numpy arrays.
yos_bins = np.array(df["yos_bins"])
yos_bins_sel = np.array(["0-1", "1-2", "2-3"])
mask = (yos_bins[:, None] == yos_bins_sel[None, :]).any(1)
df[mask]
Requesting as Years of service yos_bins
3 Employee 1.0 0-1
4 Ex-Employee 1.0 0-1
5 Employee 1.0 0-1
6 Employee 2.0 1-2
7 Employee 2.0 1-2
8 Ex-Employee 2.0 1-2
9 Ex-Employee 2.0 1-2
10 Employee 3.0 2-3
11 Employee 3.0 2-3
12 Ex-Employee 3.0 2-3
Explanation (using x as yos_bins and y as yos_bins_sel)
x[:, None] == y[None, :]).all(1)
is the main takeaway, x[:, None]
converts x from shape to (n,) to (n, 1). y[None, :]
converts y from shape (m,) to (1, m). Comparing them with ==
forms a broadcasted element-wise boolean array of shape (n, m), we want our array to be (n,)-shaped, so we apply .any(1)
so that the second dimension is compressed to True
if at least one of it's booleans is True
(which is if the element is in the yos_bins_sel array). You end up with a boolean array which can be used to mask the original Data Frame. Replace x with the array containing the values to be compared and y with the array that the values of x should be contained in and you will be able to do this for any data set.
Upvotes: 1