Reputation: 782
I have a pandas DataFrame with one list-column, say:
df = pd.DataFrame({"pairs": [["A|B", "B|C", "C|D", "D|F"], ["A|D", "D|F", "F|G", "G|D"], ["C|D", "D|X"]]})
The lists in the pairs
column always contain consecutive pairs, where the 2 elements of the pairs are separated by a |
. I want to kind of "flatten" the lists in these columns, by instead of storing the pairs, now storing the elements of the pairs in the same order. So the desired DataFrame looks like:
elements
[A, B, C, D, F]
[A, D, F, G, D]
[C, D, X]
(Edited: I also would like to have elements occur multiple times in a resulting list, like D
in the second row)
This looks so simple and I can't believe there wouldn't be an efficient solution to this, but so far I have failed to find any python method that could help me
Upvotes: 4
Views: 756
Reputation: 195543
You can use set-comprehension:
df["elements"] = df["pairs"].apply(
lambda x: {ww for w in x for ww in w.split("|")}
)
print(df)
Prints:
pairs elements
0 [A|B, B|C, C|D, D|F] {B, C, D, A, F}
1 [A|D, D|F, F|G] {G, D, F, A}
2 [C|D, D|X] {X, C, D}
If you want lists:
df["elements"] = df["pairs"].apply(
lambda x: list({ww for w in x for ww in w.split("|")})
)
print(df)
pairs elements
0 [A|B, B|C, C|D, D|F] [D, F, A, C, B]
1 [A|D, D|F, F|G] [G, D, A, F]
2 [C|D, D|X] [X, D, C]
EDIT: To mantain order:
def fn(x):
seen = set()
out = []
for v in x:
for w in v.split("|"):
if not w in seen:
seen.add(w)
out.append(w)
return out
df["elements"] = df["pairs"].apply(fn)
print(df)
Prints:
pairs elements
0 [A|B, B|C, C|D, D|F] [A, B, C, D, F]
1 [A|D, D|F, F|G, G|D] [A, D, F, G]
2 [C|D, D|X] [C, D, X]
EDIT: To keep multiple elements + order:
from itertools import groupby, chain
def fn(x):
return [v for v, _ in groupby(chain.from_iterable(v.split("|") for v in x))]
df["elements"] = df["pairs"].apply(fn)
print(df)
Prints:
pairs elements
0 [A|B, B|C, C|D, D|F] [A, B, C, D, F]
1 [A|D, D|F, F|G, G|D] [A, D, F, G, D]
2 [C|D, D|X] [C, D, X]
Upvotes: 2
Reputation: 75100
Easiest would be a double explode with groupby+unique:
df['pairs'].explode().str.split("|").explode().groupby(level=0).unique()
0 [A, B, C, D, F]
1 [A, D, F, G]
2 [C, D, X]
Name: pairs, dtype: object
Assign this back:
df['elements'] = df['pairs'].explode().str.split("|").explode().groupby(level=0).unique()
EDIT:
For considering only consecutive duplicates, use:
s = df['pairs'].explode().str.split("|").explode()
out = s[s.ne(s.shift())].groupby(level=0).agg(list)
Upvotes: 2