Reputation: 6652
I am experiencing a strange behavior on counting elements in column of sets with pd.Series.str.len() method
x = pd.DataFrame({'t': ['', 'A', 'A B', 'A B C']})
x['s'] = x.t.str.split(' ').map(set)
x['s_len'] = x.s.str.len()
x['s_reduced'] = x.s - {'A'}
x['s_reduced_len'] = x.s_reduced.str.len()
print(x)
t s s_len s_reduced s_reduced_len
0 {} 1 {} 1
1 A {A} 1 {} 0
2 A B {B, A} 2 {B} 1
3 A B C {C, B, A} 3 {C, B} 2
Why in this case the value of x.loc[0, 's_len']
is 1 and the value of x.loc[1, 's_reduced_len']
is 0?
Is it a bug and I should report it or is it an odd intended behavior?
The version of pandas is 0.20.3.
Upvotes: 0
Views: 162
Reputation: 30605
You can see the answer if you just print the contents i.e
x.s_reduced.values
array([{''}, set(), {'B'}, {'C', 'B'}], dtype=object)
The first cell is actaully not empty if holds ''
. And after subtraction second cell becomes an empty set. Hence the difference in lengths.
len({''})
1
len(set())
0
Upvotes: 3