Reputation: 13
I have a very large dataframe, and one column has strings with a fixed-length binary number.
I want to split every binary digit into his own column, and I have a working code, but is ultra slow. My code is:
import numpy as np
import pandas as pd
#data generation
stringLength=5
stringFormat='{0:0'+str(stringLength)+'b}'
temp = [ stringFormat.format(x) for x in np.random.randint(0,high=2**stringLength, size=int(1e6))]
df=pd.DataFrame(temp,columns=['binaryString'])
#slow code below
df.attrs['Some data to preserve']=""
df,df.attrs = df.join(df['binaryString'].str.split('',expand=True).iloc[:, 1:-1].add_prefix('Bit').astype(np.uint8)), df.attrs
print(df)
Can it be made faster?
I cannot use Pandarallel, because it requires "Windows Subsystem for Linux", and I cannot run it from Visual Studio, but I can use another parallelization.
Upvotes: 0
Views: 61
Reputation: 1465
Starting from the code, the critical point is:
df,df.attrs = df.join(df['binaryString'].str.split('',expand=True).iloc[:, 1:-1].add_prefix('Bit').astype(np.uint8)), df.attrs
it takes: 2.01 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I've tried another approach:
df.join(pd.DataFrame(df['binaryString'].map(list).to_list(), columns=['a','b','c','d','e']))
That seemed promising, it takes: 468 ms ± 4.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think that working directly on the values (the underlying numpy) can go faster
OP edit (name the columns automatically):
df,df.attrs =df.join(pd.DataFrame(df['binaryString'].map(list).to_list()).add_prefix('Bit').astype(np.uint8)), df.attrs
Upvotes: 1