Fontanka16
Fontanka16

Reputation: 1321

Split column in a Dask Dataframe into n number of columns

In a column in a Dask Dataframe, I have strings like this:

column_name_1 column_name_2
a^b^c j
e^f^g k^l
h^i m

I need to split these strings into columns in the same data frame, like this

column_name_1 column_name_2 column_name_1_1 column_name_1_2 column_name_1_3 column_name_2_1 column_name_2_2
a^b^c j a b c j
e^f^g k^l e f g k l
h^i m h i m

I cannot figure out how to do this without knowing in advance how many occurrences of the delimiter there are in the data. Also, there are tens of columns in the Dataframe that are to be left alone, so I need to be able to specify which columns to split like this.

My best effort either includes something like

df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)

But it fails with a

ValueError: The columns in the computed data do not match the columns in the provided metadata

Upvotes: 0

Views: 993

Answers (2)

scj13
scj13

Reputation: 335

Unfortunately using dask.dataframe.Series.str.split with expand=True and an unknown number of splits is not yet supported in Dask, the following returns a NotImplementedError:

import dask.dataframe as dd
import pandas as pd

ddf = dd.from_pandas(
    pd.DataFrame({
        'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
    }), npartitions=2
)

# returns NotImplementedError
ddf['column_name_1'].str.split('^', expand=True).compute()

Usually when a pandas equivalent has not yet been implemented in Dask, map_partitions can be used to apply a Python function on each DataFrame partition. In this case, however, Dask would still need to know how many columns to expect in order to lazily produce a Dask DataFrame, as provided with a meta argument. This makes using Dask for this task challenging. Relatedly, the ValueError occurs because column_name_2 requires only 1 split, and returns a Dask DataFrame with 2 columns, but Dask is expecting a DataFrame with 3 columns.

Here is one solution (building from @Fontanka16's answer) if you do know the number of splits ahead of time:

import dask.dataframe as dd
import pandas as pd

ddf = dd.from_pandas(
    pd.DataFrame({
        'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
    }), npartitions=2
)

ddf_list = []
num_split_dict = {'column_name_1': 2, 'column_name_2': 1}
for col, num_splits in num_split_dict.items():
    split_df = ddf[col].str.split('^', n=num_splits, expand=True).add_prefix(f'{col}_')
    ddf_list.append(split_df)
new_ddf = dd.concat([ddf] + ddf_list, axis=1)
new_ddf.compute()

Upvotes: 1

jezrael
jezrael

Reputation: 862641

Here are 2 solutions working without stack but with loop for selected columns names:

cols = ['column_name_1','column_name_2']
for c in cols:
    df = df.join(df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna(''))

print (df)
  column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2  \
0         a^b^c             j               a               b               c   
1         e^f^g           k^l               e               f               g   
2           h^i             m               h               i                   

  column_name_2_0 column_name_2_1  
0               j                  
1               k               l  
2               m                  

Or modify another solution:

cols = ['column_name_1','column_name_2']
dfs = [df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna('') for c in cols]
df = pd.concat([df] + dfs, axis=1)
print (df)
  column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2  \
0         a^b^c             j               a               b               c   
1         e^f^g           k^l               e               f               g   
2           h^i             m               h               i                   

  column_name_2_0 column_name_2_1  
0               j                  
1               k               l  
2               m                  

Upvotes: 1

Related Questions