Reputation: 43
I have a pandas dataframe in which one column of text strings contains multiple comma-separated values. I want to split each field and create a new row per entry only where the number of commas is equal to 2. My entire dataframe has only values with either no. of commas =1 or 2. For example, a should become b:
In [7]: a
Out[7]:
var1 var2 var3
0 a,b,c 1 X
1 d,e,f 2 Y
2 g,h 3 Z
In [8]: b
Out[8]:
var1 var2 var3
0 a,c 1 X
1 b,c 1 X
2 d,f 2 Y
3 e,f 2 Y
4 g,h 3 Z
Upvotes: 0
Views: 148
Reputation: 8768
Here is another way using itertools.pairwise()
df.assign(var1 = df['var1'].map(lambda x: list(map(','.join,itertools.pairwise(x.split(',')))))).explode('var1')
Old answer:
df2 = df.loc[df['var1'].str.count(',').eq(2)]
s = (df2.assign(var1 = df2['var1'].str.split(','))
.explode('var1').groupby(level=0)
.agg(one = ('var1',lambda x: x.iloc[0] +','+ x.iloc[-1]),
two = ('var1',lambda x: x.iloc[1] +','+ x.iloc[-1]))
.stack().droplevel(1))
df2 = (pd.concat([df.loc[s.index].assign(var1 = s.to_numpy()),
df.loc[df['var1'].str.count(',').eq(1)]],ignore_index=True))
Upvotes: 0
Reputation: 195428
Based on your comment that var1
column has only 1 or 2 commas:
def fn(x):
x = x.split(",")
if len(x) == 2:
return [",".join(x)]
return ["{},{}".format(x[0], x[2]), "{},{}".format(x[1], x[2])]
df = df.assign(var1=df["var1"].apply(fn)).explode("var1").reset_index(drop=True)
print(df)
Prints:
var1 var2 var3
0 a,c 1 X
1 b,c 1 X
2 d,f 2 Y
3 e,f 2 Y
4 g,h 3 Z
Upvotes: 1
Reputation: 2293
I'm doing in two steps: first, transform the first column when there are two commas, introducing a tuple of strings (this is done by applying func
to the first column, each s
is a cell string contents). Then use explode
to turn those tuples into a couple of rows.
def func(s):
t = s.split(',')
return s if len(t) == 2 else (f'{t[0]},{t[2]}', f'{t[1]},{t[2]}')
df.var1 = df.var1.apply(func)
df = df.explode('var1').reset_index(drop=True)
Upvotes: 0
Reputation: 31156
import io
import itertools
df = pd.read_csv(io.StringIO(""" var1 var2 var3
0 a,b,c 1 X
1 d,e,f 2 Y
2 g,h 3 Z"""), sep="\s+")
df["var1_2"] = df["var1"].str.split(",").apply(lambda x: [",".join(list(c))
for c in itertools.combinations(x, 2)
if len(x)<=2 or list(c) != x[:2]])
df.explode("var1_2")
var1 | var2 | var3 | var1_2 |
---|---|---|---|
a,b,c | 1 | X | a,c |
a,b,c | 1 | X | b,c |
d,e,f | 2 | Y | d,f |
d,e,f | 2 | Y | e,f |
g,h | 3 | Z | g,h |
Upvotes: 1