Reputation: 4292
I have a Pandas dataframe that consists of several columns, the cells of which may or may not contain a string. As an example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':['asfe','eseg','eesg','4dsf','','hdt','gase','gex','gsges','hhbr'],
'B':['','bdb','htsdg','','rdshg','th','tjf','','',''],
'C':['hrd','jyf','sef','hdsr','','','','','hdts','aseg'],
'D':['','','hdts','afse','nfd','','htf','','',''],
'E':['','','','','jftd','','','','jfdt','']})
...which looks like:
A B C D E
0 asfe hrd
1 eseg bdb jyf
2 eesg htsdg sef hdts
3 4dsf hdsr afse
4 rdshg nfd jftd
5 hdt th
6 gase tjf htf
7 gex
8 gsges hdts jfdt
9 hhbr aseg
I want to create a column that contains a binary representation of whether the column contains a string or not; as an example, the first row would be represented as 10100.
The only way I can think of doing this is to:
This is the code I've created:
scratchdf = pd.DataFrame().reindex_like(df)
for col in df.columns.values:
scratchdf[col] = df[col].str.contains(r'\w+',regex = True).astype(int)
scratchdf['bin'] = scratchdf['A'].astype(str) + \
scratchdf['B'].astype(str) + \
scratchdf['C'].astype(str) + \
scratchdf['D'].astype(str) + \
scratchdf['E'].astype(str)
df = df.join(scratchdf['bin'])
...which produces the final dataframe:
A B C D E bin
0 asfe hrd 10100
1 eseg bdb jyf 11100
2 eesg htsdg sef hdts 11110
3 4dsf hdsr afse 10110
4 rdshg nfd jftd 01011
5 hdt th 11000
6 gase tjf htf 11010
7 gex 10000
8 gsges hdts jfdt 10101
9 hhbr aseg 10100
This works but seems a bit wasteful (especially with large dataframes). Is there a way to create the binary representation column directly, without needing to create a scratch dataframe?
Upvotes: 3
Views: 321
Reputation: 7994
Method 1
a = np.where(df != "", "1", "0").astype("|S1")
df["bin"] = np.apply_along_axis(lambda x: x.tostring().decode("utf-8"), 1, a)
Method 2
df["bin"] = np.append(
np.where(df != "", "1", "0").astype("S1"),
np.array([["\n"]]).astype("S1").repeat(df.shape[0], axis=0),
axis=1
).tostring().decode("utf-8")[:-1].split("\n")
Method 2 appends \n
to the end of the numpy array
array([[b'1', b'0', b'1', b'0', b'0', b'\n'],
[b'1', b'1', b'1', b'0', b'0', b'\n'],
[b'1', b'1', b'1', b'1', b'0', b'\n'],
...,
[b'1', b'0', b'0', b'0', b'0', b'\n'],
[b'1', b'0', b'1', b'0', b'1', b'\n'],
[b'1', b'0', b'1', b'0', b'0', b'\n']], dtype='|S1')
Then call tostring
and decode
. Remove the last "\n" and then split by "\n".
Method 3 (Using view
Reference: numpy array of chars to string)
np.ascontiguousarray(
np.where(df != "", "1", "0").astype("S1")
).view('|S5').astype(str)
(Based on jezrael's setup df = pd.concat([df] * 1000, ignore_index=True))
# method 2
%timeit np.append(np.where(df != "", "1", "0").astype("S1"), np.array([["\n"]]).astype("S1").repeat(df.shape[0], axis=0), axis=1).tostring().decode("utf-8")[:-1].split("\n")
12.3 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# method 3
%timeit np.ascontiguousarray(np.where(df != "", "1", "0").astype("S1")).view('|S5').astype(str)
12.8 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# method 1 (slower)
%timeit np.apply_along_axis(lambda x: x.tostring().decode("utf-8"), 1, np.where(df != "", "1", "0").astype("S1"))
45 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Replicated experiments of jezrael's.
In [99]: %timeit df.astype(bool).astype(int).astype(str).values.sum(axis=1)
28.9 ms ± 782 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [100]: %timeit (df != '').astype(int).astype(str).values.sum(axis=1)
29 ms ± 645 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [101]: %timeit (df != '').astype(int).astype(str).apply(''.join, axis=1)
168 ms ± 2.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [102]: %timeit df.astype(bool).astype(int).astype(str).apply(''.join, axis=1)
173 ms ± 7.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [103]: %timeit df.astype(bool).astype(int).apply(lambda row: ''.join(str(i) for i in row), axis=1)
159 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 403
You can use the fact that empty strings correspond to False and non-empty to True. So casting the string dataframe to bool gives a dataframe as true and false. Recasting this to int converts true to 1 and false to 0, then just apply a join operation across rows:
df['binary'] = df.astype(bool).astype(int).apply(
lambda row: ''.join(str(i) for i in row), axis=1)
print(df)
result:
A B C D E binary
0 asfe hrd 10100
1 eseg bdb jyf 11100
2 eesg htsdg sef hdts 11110
3 4dsf hdsr afse 10110
4 rdshg nfd jftd 01011
5 hdt th 11000
6 gase tjf htf 11010
7 gex 10000
8 gsges hdts jfdt 10101
9 hhbr aseg 10100
edit: just realized another user posted basically the same thing (also fixed copy error)
Here's another way using generators:
def iterable_to_binary_mask(iterable):
bools = (bool(i) for i in iterable)
ints = (int(i) for i in bools)
strs = (str(i) for i in ints)
return ''.join(strs)
df['binary'] = df.apply(iterable_to_binary_mask, axis=1)
This is about 3 times slower than the type conversion approach on my machine but should use minimal memory.
Upvotes: 2
Reputation: 862901
Check empty string or convert to bool
first, then convert to int
, str
and last join
or sum
:
df['new'] = (df != '').astype(int).astype(str).apply(''.join, axis=1)
#faster alternative
df['new'] = (df != '').astype(int).astype(str).values.sum(axis=1)
print (df)
A B C D E new
0 asfe hrd 10100
1 eseg bdb jyf 11100
2 eesg htsdg sef hdts 11110
3 4dsf hdsr afse 10110
4 rdshg nfd jftd 01011
5 hdt th 11000
6 gase tjf htf 11010
7 gex 10000
8 gsges hdts jfdt 10101
9 hhbr aseg 10100
Timings:
df = pd.concat([df] * 1000, ignore_index=True)
In [99]: %timeit df.astype(bool).astype(int).astype(str).values.sum(axis=1)
10 loops, best of 3: 155 ms per loop
In [100]: %timeit (df != '').astype(int).astype(str).values.sum(axis=1)
10 loops, best of 3: 158 ms per loop
In [101]: %timeit (df != '').astype(int).astype(str).apply(''.join, axis=1)
1 loop, best of 3: 330 ms per loop
In [102]: %timeit df.astype(bool).astype(int).astype(str).apply(''.join, axis=1)
1 loop, best of 3: 326 ms per loop
In [103]: %timeit df.astype(bool).astype(int).apply(lambda row: ''.join(str(i) for i in row), axis=1)
1 loop, best of 3: 210 ms per loop
Upvotes: 2