Split text in dataframe column to multiple columns

Question

I have a very large dataframe, and one column has strings with a fixed-length binary number.

I want to split every binary digit into his own column, and I have a working code, but is ultra slow. My code is:

import numpy as np
import pandas as pd

#data generation
stringLength=5
stringFormat='{0:0'+str(stringLength)+'b}'
temp = [ stringFormat.format(x) for x in np.random.randint(0,high=2**stringLength, size=int(1e6))]
df=pd.DataFrame(temp,columns=['binaryString'])

#slow code below
df.attrs['Some data to preserve']=""
df,df.attrs = df.join(df['binaryString'].str.split('',expand=True).iloc[:, 1:-1].add_prefix('Bit').astype(np.uint8)), df.attrs

print(df)

Can it be made faster?

I cannot use Pandarallel, because it requires "Windows Subsystem for Linux", and I cannot run it from Visual Studio, but I can use another parallelization.

Glauco · Accepted Answer

Starting from the code, the critical point is:

df,df.attrs = df.join(df['binaryString'].str.split('',expand=True).iloc[:, 1:-1].add_prefix('Bit').astype(np.uint8)), df.attrs

it takes: 2.01 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I've tried another approach:

df.join(pd.DataFrame(df['binaryString'].map(list).to_list(), columns=['a','b','c','d','e']))

That seemed promising, it takes: 468 ms ± 4.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think that working directly on the values (the underlying numpy) can go faster

OP edit (name the columns automatically):

df,df.attrs =df.join(pd.DataFrame(df['binaryString'].map(list).to_list()).add_prefix('Bit').astype(np.uint8)), df.attrs

Split text in dataframe column to multiple columns

Answers (1)

Related Questions