Reputation: 1321
In a column in a Dask Dataframe, I have strings like this:
column_name_1 | column_name_2 |
---|---|
a^b^c | j |
e^f^g | k^l |
h^i | m |
I need to split these strings into columns in the same data frame, like this
column_name_1 | column_name_2 | column_name_1_1 | column_name_1_2 | column_name_1_3 | column_name_2_1 | column_name_2_2 |
---|---|---|---|---|---|---|
a^b^c | j | a | b | c | j | |
e^f^g | k^l | e | f | g | k | l |
h^i | m | h | i | m |
I cannot figure out how to do this without knowing in advance how many occurrences of the delimiter there are in the data. Also, there are tens of columns in the Dataframe that are to be left alone, so I need to be able to specify which columns to split like this.
My best effort either includes something like
df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)
But it fails with a
ValueError: The columns in the computed data do not match the columns in the provided metadata
Upvotes: 0
Views: 993
Reputation: 335
Unfortunately using dask.dataframe.Series.str.split
with expand=True
and an unknown number of splits is not yet supported in Dask, the following returns a NotImplementedError
:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
# returns NotImplementedError
ddf['column_name_1'].str.split('^', expand=True).compute()
Usually when a pandas equivalent has not yet been implemented in Dask, map_partitions
can be used to apply a Python function on each DataFrame partition. In this case, however, Dask would still need to know how many columns to expect in order to lazily produce a Dask DataFrame, as provided with a meta
argument. This makes using Dask for this task challenging. Relatedly, the ValueError
occurs because column_name_2
requires only 1 split, and returns a Dask DataFrame with 2 columns, but Dask is expecting a DataFrame with 3 columns.
Here is one solution (building from @Fontanka16's answer) if you do know the number of splits ahead of time:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
ddf_list = []
num_split_dict = {'column_name_1': 2, 'column_name_2': 1}
for col, num_splits in num_split_dict.items():
split_df = ddf[col].str.split('^', n=num_splits, expand=True).add_prefix(f'{col}_')
ddf_list.append(split_df)
new_ddf = dd.concat([ddf] + ddf_list, axis=1)
new_ddf.compute()
Upvotes: 1
Reputation: 862641
Here are 2 solutions working without stack
but with loop for selected columns names:
cols = ['column_name_1','column_name_2']
for c in cols:
df = df.join(df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna(''))
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Or modify another solution:
cols = ['column_name_1','column_name_2']
dfs = [df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna('') for c in cols]
df = pd.concat([df] + dfs, axis=1)
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Upvotes: 1