ju_gee_bear
ju_gee_bear

Reputation: 33

Filter Pandas DataFrame with Nested Arrays

I have a pandas dataframe that contains arrays within some of its columns. I'd like to filter the dataframe to only contain rows that have a certain value found in the nested array for that column.

For example, I have a dataframe something like this:

label MODEL_INDEX ARRAY_VAL
ID
0    4   (11.0,  0.0)   
1   65   (11.0, 10.0)   
2   73   (11.0, 10.0)   
3   74   (10.0,  0.0)   
4   79   (11.0,  0.0)   
5   80   (10.0,  0.0)   
6   88   (11.0,  0.0) 

And I'd like to filter the dataframe to only include those satisfying some variable condition, say containing 10.0, in the array under ARRAY_VAL to get this:

label MODEL_INDEX ARRAY_VAL
ID  
1   65   (11.0, 10.0)   
2   73   (11.0, 10.0)   
3   74   (10.0,  0.0)    
5   80   (10.0,  0.0) 

Essentially, looking for something like:

df[df['ARRAY_VAL'] where 10.0 in df['ARRAY_VAL]]

Upvotes: 3

Views: 9819

Answers (3)

spies006
spies006

Reputation: 2927

First build up an index

index = []
for i, row in enumerate(df.ARRAY_VAL):
    if 10.0 in row:
        index.append(i)

then index the data where we found 10.0 in df['ARRAY_VAL']

>>> df.loc[index]

   MODEL_INDEX ARRAY_VAL
1         65  (11, 10)
2         73  (11, 10)
3         74   (10, 0)
5         80   (10, 0)

Upvotes: 0

Niels Joaquin
Niels Joaquin

Reputation: 1215

I think apply is needed since you want to test 10.0 in x for every tuple value x.

df[df['ARRAY_VAL'].apply(lambda x: 10.0 in x)]

Upvotes: 4

Adrienne
Adrienne

Reputation: 324

You can use .apply to search the list in each row of the data frame:

# creating the dataframe
df = pd.DataFrame(columns = ['model_idx','array_val'])
df.model_idx = [4,65,73,74,79,80,88]
df.array_val = [[11,0],
                [11,10],
                [11,10],
                [10,0],
                [11,0],
                [10,0],
                [11,0]]

# results is a boolean indicating whether the value is found in the list
results = df.array_val.apply(lambda a: 10 in a)

# filter the dataframe based on the boolean indicator
df_final = df[results]

The filtered data frame is:

In [41]: df_final.head()
Out[41]: 
   model_idx array_val
1         65  [11, 10]
2         73  [11, 10]
3         74   [10, 0]
5         80   [10, 0]

Upvotes: 9

Related Questions