aehruesch
aehruesch

Reputation: 55

Using numpy.where (or numpy.select) on a lists within a pandas DataFrame

The following question is a simplification of this: Iterating through lists within a pandas DataFrame

I have a DataFrame which contains a column of lists:

import numpy as np
import pandas as pd

col = [["A", "B", "C", "D"], ["E", "F"]]
d = {"col" : [["A", "B", "C", "D"], ["E", "F"]]}

df = pd.DataFrame(d)

print(df)
Out[2]: 
            col
0  [A, B, C, D]
1        [E, F]

For each row I want to iterate through the list and select between the following cases:

The resulting DataFrame should look like this:

            col  0    1    2       3      4       5
0  [A, B, C, D]  A  B-A  C-B     D-C      D  np.nan
1        [E, F]  E  F-E    F  np.nan np.nan  np.nan

To get to this result I tried a nested numpy.where function:

for i in range(7):
    df[i] = pd.DataFrame(np.where(i == 0,
                                  df["col"].apply(lambda x: x[0]),
                                  np.where(i == df["col"].apply(len),
                                           df["col"].apply(lambda x: x[-1]),
                                           np.where((i > 0) & (i <= df["col"].apply(len) - 1),
                                                    df["col"].apply(lambda x: x[i]) + '-' + df["col"].apply(lambda x: x[i-1]),
                                                    np.nan
                                                    )
                                           )
                                  )
                          )
                           
print(df)

And here is my problem: I get an IndexError: list index out of range

I suppose it has something to do with the i. Even if I catch the invalid cased of i the whole nested where term is not valid. (I tried it with numpy.select too, but got the same result.)

If I replace the index i with 1 it works (of cource it will give me the wrong values, but I don't get an error), so it has to have something to do with this index, but I can't figure out how to solve the problem:

for i in range(7):
    df[i] = pd.DataFrame(np.where(i == 0,
                                  df["col"].apply(lambda x: x[0]),
                                  np.where(i == df["col"].apply(len),
                                           df["col"].apply(lambda x: x[-1]),
                                           np.where((i > 0) & (i <= df["col"].apply(len) - 1),
                                                    df["col"].apply(lambda x: x[1]) + '-' + df["col"].apply(lambda x: x[1-1]),
                                                    np.nan
                                                    )
                                           )
                                  )
                          )
                               
print(df)

            col  0    1    2       3      4       5
0  [A, B, C, D]  A  B-A  B-A     B-A      D  np.nan
1        [E, F]  E  F-E    F  np.nan np.nan  np.nan

Can you think of a solution to this or to an alternative way to get my desired DataFrame?

Upvotes: 1

Views: 1132

Answers (1)

stevepastelan
stevepastelan

Reputation: 1304

I would code the logic as a separate function:

from typing import List

def compute_event_transitions(L: List[str]) -> pd.Series: 
    if len(L) <= 1:
        return pd.Series(L)

    first = pd.Series(L[0])
    last = pd.Series(L[-1])

    s1 = pd.Series(L)
    s2 = s1.shift(+1)

    middle = (
        pd.concat([s2, s1], axis='columns')
        [1:]  # The first element has no "from" transition
        .apply(lambda s: "-".join(s.tolist()), axis='columns')        
    )

    transitions = pd.concat([first, middle, last]).reset_index(drop=True)

    return transitions

Now you can apply this calculation to each element in your dataframe:

all_transitions = df['col'].apply(compute_event_transitions)
   0    1    2    3    4
0  A  A-B  B-C  C-D    D
1  E  E-F    F  NaN  NaN

Note that it is indexed the same way as your original dataframe, so you can stitch it back to the column of lists:

pd.concat([df, all_transitions], axis='columns')
            col  0    1    2    3    4
0  [A, B, C, D]  A  A-B  B-C  C-D    D
1        [E, F]  E  E-F    F  NaN  NaN

Upvotes: 1

Related Questions