porteclefs
porteclefs

Reputation: 507

Pandas: Conver all strings in column to 1

I have a df with a column df.open. I want to check this column for strings. If there's a string, I'd like to convert it to a 1. There are already a lot of 1s and 0s in the column.

So, suppose the values in the column are as follows: <0,0,1,0,text,1,open,0,0,xyz,1>. I'd like to go through the column and turn all strings (i.e. 'text', 'open', 'xyz') into 1s.

One thing I was thinking was to convert numeric. Then, where ever there's a NaN, convert that to a 1. But that seems silly...

Upvotes: 0

Views: 85

Answers (2)

Dr. Jan-Philip Gehrcke
Dr. Jan-Philip Gehrcke

Reputation: 35776

In short, first create a mask and then use this mask for indexing to change values in place:

boolmask_nonumbers = np.isnan(df.open.convert_objects(convert_numeric=True))
df.open[boolmask_nonumbers] = 1

Longer explanation:

The test object, a Series (which essentially is a column in a DataFrame):

>>> s = pd.Series([1, 0, "foo", 1])
>>> s
0      1
1      0
2    foo
3      1
dtype: object

Note that it is of type object, indicating the mixture of data types. Use pandas' built-in function convert_objects to convert values to numeric values, leaving those which are not numeric as NaNs:

>>> s_num = s.convert_objects(convert_numeric=True)
>>> s_num
0     1
1     0
2   NaN
3     1
dtype: float64

As you can see, the data type changed to float64. Now let's set all NaNs to 1, via bool indexing, making use of numpy's isnan() function:

>>> s_num[np.isnan(s_num)] = 1

There you are:

>>> s_num
0    1
1    0
2    1
3    1

Edit

This approach is a little too generic for your data, because it also handles floats. The trick that eumiro uses, to evaluate all of your values as booleans, and to then convert back to integer, serves a significant speedup.

His approach, working on a column with 10^6 items, takes about 4 seconds for 100 iterations on my machine:

>>> timeit.timeit("s=pd.Series(samples); s = s.astype(bool).astype(int)", setup="import numpy as np; import pandas as pd; import random as rd; values = [1, 0, 'lol']; samples = [rd.choice(values) for _ in xrange(10**6)]", number=100)
3.974711554180317

My approach takes almost 10 times as long:

>>> timeit.timeit("s=pd.Series(samples); s[np.isnan(s.convert_objects(convert_numeric=True))]=1", setup="import numpy as np; import pandas as pd; import random as rd; values = [1, 0, 'lol']; samples = [rd.choice(values) for _ in xrange(10**6)]", number=100)
30.0552612268267

Given durations include data creation, which takes that long:

>>> timeit.timeit("s = pd.Series(samples)", setup="import numpy as np; import pandas as pd; import random as rd; values = [1, 0, 'lol']; samples = [rd.choice(values) for _ in xrange(10**6)]", number=100)
1.694933953806128

That is, with eumiro's approach the data creation takes about as long as data processing.

Upvotes: 1

eumiro
eumiro

Reputation: 213005

Convert all to boolean then to int:

df.open = df.open.astype(bool).astype(int)

1 and any non-empty text is True, 0 is False.

True is 1, False is 0.

Upvotes: 2

Related Questions