Cameron Cheung
Cameron Cheung

Reputation: 319

Get a subset of columns by row value in pandas

I have a DataFrame of users and their ratings for movies:

userId   movie1   movie2   movie3   movie4   movie5   movie6
0        4.1      NaN      1.0      NaN      2.1      NaN  
1        3.1      1.1      3.4      1.4      NaN      NaN  
2        2.8      NaN      1.7      NaN      3.0      NaN  
3        NaN      5.0      NaN      2.3      NaN      2.1  
4        NaN      NaN      NaN      NaN      NaN      NaN  
5        2.3      NaN      2.0      4.0      NaN      NaN  

There isnt actually a userId column in the dataframe, it's just being used for the index

From this DataFrame, I'm trying to make a another DataFrame that only contain movies that have been rated by a specific user. For example if I wanted to make a new DataFrame of movies only rated by user with userId == 0 the output would a dataframe with:

userId   movie1   movie3   movie5
0        4.1      1.0      2.1 
1        3.1      3.4      NaN  
2        2.8      1.7      3.0  
3        NaN      NaN      NaN  
4        NaN      NaN      NaN  
5        2.3      2.0      NaN  

I know how to iterate over the columns but I dont know how to select the columns I want by checking a row value.

Upvotes: 1

Views: 503

Answers (3)

sophocles
sophocles

Reputation: 13841

Declare and loc the userId of interest into a new df keeping only the relevant columns.

Then pd.concat the new df with the other userId's and keep columns (movies) of your userId that you selected:

user = 0 # set your userId

a = df.loc[[user]].dropna(axis=1)
b = pd.concat([a, (df.drop(a.index))[[i for i in a.columns]]])

Which prints:

b
        movie1  movie3  movie5
userId                        
0         4.10    1.00    2.10
1         3.10    3.40     NaN
2         2.80    1.70    3.00
3          NaN     NaN     NaN
4          NaN     NaN     NaN
5         2.30    2.00     NaN

Note that I have set the index to be userId as you specified.

Upvotes: 0

Shubham Sharma
Shubham Sharma

Reputation: 71707

You can use .loc accessor to select the particular userId then use notna to create a boolean mask which specifies the columns which does not contain NaN values, finally use this boolean mask to filter the columns:

userId = 0 # specify the userid here
df_user = df.loc[:, df.loc[userId].notna()]

Details:

>>> df.loc[userId].notna()

movie1     True
movie2    False
movie3     True
movie4    False
movie5     True
movie6    False
Name: 0, dtype: bool

>>> df.loc[:, df.loc[userId].notna()]

        movie1  movie3  movie5
userId                        
0          4.1     1.0     2.1
1          3.1     3.4     NaN
2          2.8     1.7     3.0
3          NaN     NaN     NaN
4          NaN     NaN     NaN
5          2.3     2.0     NaN

Upvotes: 2

Zero Pancakes
Zero Pancakes

Reputation: 296

Another approach:

import pandas as pd

user0 = df.iloc[0,:]       #select the first row
flags = user0.notna()      #flag the non NaN values
flags = flags.tolist()     #convert to list instead of series
newdf = df.iloc[:,flags]   #return all rows, and the columns where flags are true

Upvotes: 0

Related Questions