How to flatten delimited strings in a list-column of a Pandas DataFrame?

Question

I have a pandas DataFrame with one list-column, say:

df = pd.DataFrame({"pairs": [["A|B", "B|C", "C|D", "D|F"], ["A|D", "D|F", "F|G", "G|D"], ["C|D", "D|X"]]})

The lists in the pairs column always contain consecutive pairs, where the 2 elements of the pairs are separated by a |. I want to kind of "flatten" the lists in these columns, by instead of storing the pairs, now storing the elements of the pairs in the same order. So the desired DataFrame looks like:

elements
[A, B, C, D, F]
[A, D, F, G, D]
[C, D, X]

(Edited: I also would like to have elements occur multiple times in a resulting list, like D in the second row)

This looks so simple and I can't believe there wouldn't be an efficient solution to this, but so far I have failed to find any python method that could help me

Andrej Kesely · Accepted Answer

You can use set-comprehension:

df["elements"] = df["pairs"].apply(
    lambda x: {ww for w in x for ww in w.split("|")}
)
print(df)

Prints:

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  {B, C, D, A, F}
1       [A|D, D|F, F|G]     {G, D, F, A}
2            [C|D, D|X]        {X, C, D}

If you want lists:

df["elements"] = df["pairs"].apply(
    lambda x: list({ww for w in x for ww in w.split("|")})
)
print(df)

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  [D, F, A, C, B]
1       [A|D, D|F, F|G]     [G, D, A, F]
2            [C|D, D|X]        [X, D, C]

EDIT: To mantain order:

def fn(x):
    seen = set()
    out = []
    for v in x:
        for w in v.split("|"):
            if not w in seen:
                seen.add(w)
                out.append(w)
    return out


df["elements"] = df["pairs"].apply(fn)
print(df)

Prints:

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  [A, B, C, D, F]
1  [A|D, D|F, F|G, G|D]     [A, D, F, G]
2            [C|D, D|X]        [C, D, X]

EDIT: To keep multiple elements + order:

from itertools import groupby, chain

def fn(x):
    return [v for v, _ in groupby(chain.from_iterable(v.split("|") for v in x))]

df["elements"] = df["pairs"].apply(fn)
print(df)

Prints:

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  [A, B, C, D, F]
1  [A|D, D|F, F|G, G|D]  [A, D, F, G, D]
2            [C|D, D|X]        [C, D, X]

How to flatten delimited strings in a list-column of a Pandas DataFrame?

Answers (2)

Related Questions