user1718097
user1718097

Reputation: 4292

Creating a binary representation of whether a string is present or not in a Pandas dataframe

I have a Pandas dataframe that consists of several columns, the cells of which may or may not contain a string. As an example:

import numpy as np
import pandas as pd

df = pd.DataFrame({'A':['asfe','eseg','eesg','4dsf','','hdt','gase','gex','gsges','hhbr'],
                   'B':['','bdb','htsdg','','rdshg','th','tjf','','',''],
                   'C':['hrd','jyf','sef','hdsr','','','','','hdts','aseg'],
                   'D':['','','hdts','afse','nfd','','htf','','',''],
                   'E':['','','','','jftd','','','','jfdt','']})

...which looks like:

       A      B     C     D     E
0   asfe          hrd            
1   eseg    bdb   jyf            
2   eesg  htsdg   sef  hdts      
3   4dsf         hdsr  afse      
4         rdshg         nfd  jftd
5    hdt     th                  
6   gase    tjf         htf      
7    gex                         
8  gsges         hdts        jfdt
9   hhbr         aseg            

I want to create a column that contains a binary representation of whether the column contains a string or not; as an example, the first row would be represented as 10100.

The only way I can think of doing this is to:

  1. Create a scratch dataframe
  2. Step through each column detecting whether the cells contain any characters and representing as 0/1
  3. Concatenate the binary outcomes into a single string
  4. Copy column from scratch dataframe back to original.

This is the code I've created:

scratchdf = pd.DataFrame().reindex_like(df)

for col in df.columns.values:
    scratchdf[col] = df[col].str.contains(r'\w+',regex = True).astype(int)

scratchdf['bin'] =  scratchdf['A'].astype(str) + \
                    scratchdf['B'].astype(str) + \
                    scratchdf['C'].astype(str) + \
                    scratchdf['D'].astype(str) + \
                    scratchdf['E'].astype(str)

df = df.join(scratchdf['bin'])

...which produces the final dataframe:

       A      B     C     D     E    bin
0   asfe          hrd              10100
1   eseg    bdb   jyf              11100
2   eesg  htsdg   sef  hdts        11110
3   4dsf         hdsr  afse        10110
4         rdshg         nfd  jftd  01011
5    hdt     th                    11000
6   gase    tjf         htf        11010
7    gex                           10000
8  gsges         hdts        jfdt  10101
9   hhbr         aseg              10100

This works but seems a bit wasteful (especially with large dataframes). Is there a way to create the binary representation column directly, without needing to create a scratch dataframe?

Upvotes: 3

Views: 321

Answers (3)

Tai
Tai

Reputation: 7994

Method 1

a = np.where(df != "", "1", "0").astype("|S1")
df["bin"] = np.apply_along_axis(lambda x: x.tostring().decode("utf-8"), 1, a)

Method 2

df["bin"] = np.append(
               np.where(df != "", "1", "0").astype("S1"), 
               np.array([["\n"]]).astype("S1").repeat(df.shape[0], axis=0), 
               axis=1
            ).tostring().decode("utf-8")[:-1].split("\n")

Method 2 appends \n to the end of the numpy array

array([[b'1', b'0', b'1', b'0', b'0', b'\n'],
       [b'1', b'1', b'1', b'0', b'0', b'\n'],
       [b'1', b'1', b'1', b'1', b'0', b'\n'],
       ...,
       [b'1', b'0', b'0', b'0', b'0', b'\n'],
       [b'1', b'0', b'1', b'0', b'1', b'\n'],
       [b'1', b'0', b'1', b'0', b'0', b'\n']], dtype='|S1')

Then call tostring and decode. Remove the last "\n" and then split by "\n".

Method 3 (Using view Reference: numpy array of chars to string)

np.ascontiguousarray(
    np.where(df != "", "1", "0").astype("S1")
).view('|S5').astype(str)

Timings:

(Based on jezrael's setup df = pd.concat([df] * 1000, ignore_index=True))

# method 2
%timeit np.append(np.where(df != "", "1", "0").astype("S1"), np.array([["\n"]]).astype("S1").repeat(df.shape[0], axis=0), axis=1).tostring().decode("utf-8")[:-1].split("\n")
12.3 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# method 3
%timeit np.ascontiguousarray(np.where(df != "", "1", "0").astype("S1")).view('|S5').astype(str)
12.8 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# method 1 (slower)
%timeit np.apply_along_axis(lambda x: x.tostring().decode("utf-8"), 1, np.where(df != "", "1", "0").astype("S1"))
45 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Replicated experiments of jezrael's.

In [99]: %timeit df.astype(bool).astype(int).astype(str).values.sum(axis=1)
28.9 ms ± 782 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [100]: %timeit (df != '').astype(int).astype(str).values.sum(axis=1)
29 ms ± 645 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [101]: %timeit (df != '').astype(int).astype(str).apply(''.join, axis=1)
168 ms ± 2.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [102]: %timeit df.astype(bool).astype(int).astype(str).apply(''.join, axis=1)
173 ms ± 7.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [103]: %timeit df.astype(bool).astype(int).apply(lambda row: ''.join(str(i) for i in row), axis=1)
159 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

evamicur
evamicur

Reputation: 403

You can use the fact that empty strings correspond to False and non-empty to True. So casting the string dataframe to bool gives a dataframe as true and false. Recasting this to int converts true to 1 and false to 0, then just apply a join operation across rows:

df['binary'] = df.astype(bool).astype(int).apply(
    lambda row: ''.join(str(i) for i in row), axis=1)
print(df)

result:

       A      B     C     D     E  binary
0   asfe          hrd              10100
1   eseg    bdb   jyf              11100
2   eesg  htsdg   sef  hdts        11110
3   4dsf         hdsr  afse        10110
4         rdshg         nfd  jftd  01011
5    hdt     th                    11000
6   gase    tjf         htf        11010
7    gex                           10000
8  gsges         hdts        jfdt  10101
9   hhbr         aseg              10100

edit: just realized another user posted basically the same thing (also fixed copy error)

Here's another way using generators:

def iterable_to_binary_mask(iterable):
    bools = (bool(i) for i in iterable)
    ints = (int(i) for i in bools)
    strs = (str(i) for i in ints)
    return ''.join(strs)

df['binary'] = df.apply(iterable_to_binary_mask, axis=1)

This is about 3 times slower than the type conversion approach on my machine but should use minimal memory.

Upvotes: 2

jezrael
jezrael

Reputation: 862901

Check empty string or convert to bool first, then convert to int, str and last join or sum:

df['new'] = (df != '').astype(int).astype(str).apply(''.join, axis=1)

#faster alternative
df['new'] = (df != '').astype(int).astype(str).values.sum(axis=1)

print (df)

       A      B     C     D     E    new
0   asfe          hrd              10100
1   eseg    bdb   jyf              11100
2   eesg  htsdg   sef  hdts        11110
3   4dsf         hdsr  afse        10110
4         rdshg         nfd  jftd  01011
5    hdt     th                    11000
6   gase    tjf         htf        11010
7    gex                           10000
8  gsges         hdts        jfdt  10101
9   hhbr         aseg              10100

Timings:

df = pd.concat([df] * 1000, ignore_index=True)

In [99]: %timeit df.astype(bool).astype(int).astype(str).values.sum(axis=1)
10 loops, best of 3: 155 ms per loop

In [100]: %timeit (df != '').astype(int).astype(str).values.sum(axis=1)
10 loops, best of 3: 158 ms per loop

In [101]: %timeit (df != '').astype(int).astype(str).apply(''.join, axis=1)
1 loop, best of 3: 330 ms per loop

In [102]: %timeit df.astype(bool).astype(int).astype(str).apply(''.join, axis=1)
1 loop, best of 3: 326 ms per loop

In [103]: %timeit df.astype(bool).astype(int).apply(lambda row: ''.join(str(i) for i in row), axis=1)
1 loop, best of 3: 210 ms per loop

Upvotes: 2

Related Questions