Reputation: 507
I have a df with a column df.open. I want to check this column for strings. If there's a string, I'd like to convert it to a 1. There are already a lot of 1s and 0s in the column.
So, suppose the values in the column are as follows: <0,0,1,0,text,1,open,0,0,xyz,1>. I'd like to go through the column and turn all strings (i.e. 'text', 'open', 'xyz') into 1s.
One thing I was thinking was to convert numeric. Then, where ever there's a NaN, convert that to a 1. But that seems silly...
Upvotes: 0
Views: 85
Reputation: 35776
In short, first create a mask and then use this mask for indexing to change values in place:
boolmask_nonumbers = np.isnan(df.open.convert_objects(convert_numeric=True))
df.open[boolmask_nonumbers] = 1
Longer explanation:
The test object, a Series (which essentially is a column in a DataFrame):
>>> s = pd.Series([1, 0, "foo", 1])
>>> s
0 1
1 0
2 foo
3 1
dtype: object
Note that it is of type object
, indicating the mixture of data types. Use pandas' built-in function convert_objects to convert values to numeric values, leaving those which are not numeric as NaN
s:
>>> s_num = s.convert_objects(convert_numeric=True)
>>> s_num
0 1
1 0
2 NaN
3 1
dtype: float64
As you can see, the data type changed to float64
. Now let's set all NaNs to 1, via bool indexing, making use of numpy's isnan()
function:
>>> s_num[np.isnan(s_num)] = 1
There you are:
>>> s_num
0 1
1 0
2 1
3 1
Edit
This approach is a little too generic for your data, because it also handles floats. The trick that eumiro uses, to evaluate all of your values as booleans, and to then convert back to integer, serves a significant speedup.
His approach, working on a column with 10^6 items, takes about 4 seconds for 100 iterations on my machine:
>>> timeit.timeit("s=pd.Series(samples); s = s.astype(bool).astype(int)", setup="import numpy as np; import pandas as pd; import random as rd; values = [1, 0, 'lol']; samples = [rd.choice(values) for _ in xrange(10**6)]", number=100)
3.974711554180317
My approach takes almost 10 times as long:
>>> timeit.timeit("s=pd.Series(samples); s[np.isnan(s.convert_objects(convert_numeric=True))]=1", setup="import numpy as np; import pandas as pd; import random as rd; values = [1, 0, 'lol']; samples = [rd.choice(values) for _ in xrange(10**6)]", number=100)
30.0552612268267
Given durations include data creation, which takes that long:
>>> timeit.timeit("s = pd.Series(samples)", setup="import numpy as np; import pandas as pd; import random as rd; values = [1, 0, 'lol']; samples = [rd.choice(values) for _ in xrange(10**6)]", number=100)
1.694933953806128
That is, with eumiro's approach the data creation takes about as long as data processing.
Upvotes: 1
Reputation: 213005
Convert all to boolean then to int:
df.open = df.open.astype(bool).astype(int)
1
and any non-empty text is True
, 0
is False
.
True
is 1
, False
is 0
.
Upvotes: 2