xavi
xavi

Reputation: 80

Return specific values of a dataframe based on a condition

I have a list like:

list_of_list = [["aa", "yy"], ["gg", "xx"]]

and a pandas dataframe like this:

   month   column1  column2
0  June     xx        aa               
1  June     gg        xx                
2  August   xx        yy         

I go through each sub_list of the list_of_lists and check if each item of the sub_list exists in any of the columns of the dataframe.

for sub_list in list_of_lists:
    print("\n")
    print("Sub_list: ", sub_list)
    for item in sub_list:
        for col in df.columns:
            if item in df[col].values:
                print(df[col][df[col] == item].to_frame())

This returns me:

Sub_list:  ['aa', 'yy']

  column2
0      aa

  column2
2      yy



Sub_list:  ["gg", "xx"]

  column1
1      gg

  column1
0      xx

  column1
2      xx

  column2
1      xx

What I want is except to the above to return also the index and the corresponding month when and only when, each sub_list of the list_of_lists is not in the same index

So for list_of_list = [["aa", "yy"]] except to the above code i want also to return

0  June   aa
2  August yy  

Any ideas?

Upvotes: 0

Views: 948

Answers (3)

Corralien
Corralien

Reputation: 120559

Use melt to flat your original dataframe and create a dataframe from your list_of_list then merge them and finally remove duplicates in two pass.

Step 1. Format your dataframes

df1 = df.melt('month', ignore_index=False).reset_index()
df2 = pd.DataFrame(list_of_list).melt(ignore_index=False)['value'].reset_index()
print(df1, df2)

# Output
# df1 (from your original df)
   index   month variable value
0      0    June  column1    xx
1      1    June  column1    gg
2      2  August  column1    xx
3      0    June  column2    aa
4      1    June  column2    xx
5      2  August  column2    yy

# df2 (from list_of_list)
   index value
0      0    aa
1      1    gg
2      0    yy
3      1    xx

Step 2. Merge your two dataframes

out = df1.merge(df2, on='value')
print(out)

# Output:
   index_x   month variable value  index_y
0        0    June  column1    xx        1
1        2  August  column1    xx        1
2        1    June  column2    xx        1
3        1    June  column1    gg        1
4        0    June  column2    aa        0
5        2  August  column2    yy        0

Step 3. Keep only right rows

out = out.drop_duplicates(subset=['index_x', 'index_y'], keep=False) \
         .drop_duplicates('value', keep=False)
print(out)

# Output:
   index_x   month variable value  index_y
4        0    June  column2    aa        0
5        2  August  column2    yy        0

Step 4. Final output

out = out.set_index('index_x').rename_axis(index=None)[['month', 'value']]
print(out)

# Output:
    month value
0    June    aa
2  August    yy

Edit Start after step 2, create a new column notin:

out['notin'] = out.groupby('index_y')['index_x'] \
                  .transform(lambda x: len(x) == len(set(x)))
print(out)

# Output:
   index_x   month variable value  index_y  notin
0        0    June  column1    xx        1  False  # For ["gg", "xx"]
1        2  August  column1    xx        1  False
2        1    June  column2    xx        1  False
3        1    June  column1    gg        1  False
4        0    June  column2    aa        0   True  # For ["aa", "yy"]
5        2  August  column2    yy        0   True

From here, what is your expect output?

Upvotes: 1

not_speshal
not_speshal

Reputation: 23166

stack the DataFrame and then loop through your lists:

df["index"] = df.index
srs = df.set_index(["index", "month"]).stack().rename("value")
for sublist in list_of_list:
    print(f"\nSub list: {sublist}")
    print(srs[srs.isin(sublist)].reset_index(2, drop=True).reset_index().drop_duplicates("index", keep=False).drop_duplicates("value", keep=False))
Inputs:
df = pd.DataFrame({"month": ["June", "June", "August"],
                   "column1": ["xx", "gg", "xx"],
                   "column2": ["aa", "xx", "yy"]})

list_of_list = [["aa", "yy"], ["gg", "xx"]]
Output:
Sub list: ['aa', 'yy']
   index   month value
0      0    June    aa
1      2  August    yy

Sub list: ['gg', 'xx']
Empty DataFrame
Columns: [index, month, value]
Index: []

Upvotes: 0

Roxy
Roxy

Reputation: 1053

Could use the any to achieve the same.

for sub_list in list_of_lists:
    print("\n")
    print("Sub_list: ", sub_list)
    for item in sub_list:
        print(df[(df == item).any(axis=1)])

Upvotes: 0

Related Questions