Peter
Peter

Reputation: 782

How to flatten delimited strings in a list-column of a Pandas DataFrame?

I have a pandas DataFrame with one list-column, say:

df = pd.DataFrame({"pairs": [["A|B", "B|C", "C|D", "D|F"], ["A|D", "D|F", "F|G", "G|D"], ["C|D", "D|X"]]})

The lists in the pairs column always contain consecutive pairs, where the 2 elements of the pairs are separated by a |. I want to kind of "flatten" the lists in these columns, by instead of storing the pairs, now storing the elements of the pairs in the same order. So the desired DataFrame looks like:

elements
[A, B, C, D, F]
[A, D, F, G, D]
[C, D, X]

(Edited: I also would like to have elements occur multiple times in a resulting list, like D in the second row)

This looks so simple and I can't believe there wouldn't be an efficient solution to this, but so far I have failed to find any python method that could help me

Upvotes: 4

Views: 756

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195543

You can use set-comprehension:

df["elements"] = df["pairs"].apply(
    lambda x: {ww for w in x for ww in w.split("|")}
)
print(df)

Prints:

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  {B, C, D, A, F}
1       [A|D, D|F, F|G]     {G, D, F, A}
2            [C|D, D|X]        {X, C, D}

If you want lists:

df["elements"] = df["pairs"].apply(
    lambda x: list({ww for w in x for ww in w.split("|")})
)
print(df)
                  pairs         elements
0  [A|B, B|C, C|D, D|F]  [D, F, A, C, B]
1       [A|D, D|F, F|G]     [G, D, A, F]
2            [C|D, D|X]        [X, D, C]

EDIT: To mantain order:

def fn(x):
    seen = set()
    out = []
    for v in x:
        for w in v.split("|"):
            if not w in seen:
                seen.add(w)
                out.append(w)
    return out


df["elements"] = df["pairs"].apply(fn)
print(df)

Prints:

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  [A, B, C, D, F]
1  [A|D, D|F, F|G, G|D]     [A, D, F, G]
2            [C|D, D|X]        [C, D, X]


EDIT: To keep multiple elements + order:

from itertools import groupby, chain

def fn(x):
    return [v for v, _ in groupby(chain.from_iterable(v.split("|") for v in x))]

df["elements"] = df["pairs"].apply(fn)
print(df)

Prints:

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  [A, B, C, D, F]
1  [A|D, D|F, F|G, G|D]  [A, D, F, G, D]
2            [C|D, D|X]        [C, D, X]

Upvotes: 2

anky
anky

Reputation: 75100

Easiest would be a double explode with groupby+unique:

df['pairs'].explode().str.split("|").explode().groupby(level=0).unique()

0    [A, B, C, D, F]
1       [A, D, F, G]
2          [C, D, X]
Name: pairs, dtype: object

Assign this back:

df['elements'] = df['pairs'].explode().str.split("|").explode().groupby(level=0).unique()

EDIT:

For considering only consecutive duplicates, use:

s = df['pairs'].explode().str.split("|").explode()
out = s[s.ne(s.shift())].groupby(level=0).agg(list)

Upvotes: 2

Related Questions