Maria
Maria

Reputation: 1297

Select rows based on value in python pandas

I have a text file of subject #s like this (called subjects_visit1.txt):

577
610
650
770
883

and an Excel sheet that looks like this:

Low Connectivity    Subject CanEstimate Entropy     High Connectivity   Subject CanEstimate Entropy
0.0816764   870  TRUE   0.308933317     -0.0313064  668  TRUE   0.868862941
0.215038    577  TRUE   0.448918189     0.172506    600  TRUE   0.885315753
0.0596745   651  TRUE   0.47695019      0.268619    595  TRUE   0.896439952
0.267082    817  TRUE   0.500621849     -0.0507346  851  TRUE   0.907089718
0.18407 567  TRUE   0.508109648     0.182189    822  TRUE   0.915782923
0.0326328   731  TRUE   0.517241379     0.201325    623  TRUE   0.929279958
0.237822    625  TRUE   0.518493071     0.511613    622  TRUE   0.953520938
0.246291    913  TRUE   0.548079129     0.101731    850  TRUE   0.956564212
0.182494    619  TRUE   0.554281617     0.0195069   823  TRUE   0.958840854
-0.0321676  610  TRUE   0.55939053      0.0610047   632  TRUE   0.960237986
0.198884    655  TRUE   0.581442494     0.155816    770  TRUE   0.973656398
0.029618    631  TRUE   0.620796248     0.0703396   754  TRUE   1.012278949
0.205221    866  TRUE   0.630981714     0.19077 804  TRUE   1.023361826
-0.00397881 842  TRUE   0.658492788     0.115125    830  TRUE   1.033213695
0.193168    880  TRUE   0.665481783     -0.0440176  621  TRUE   1.035325469
0.0187139   838  TRUE   0.670966904     0.231593    603  TRUE   1.087118914
-0.0483586  829  TRUE   0.678253186     0.720004    732  TRUE   1.229303773
0.238947    634  TRUE   0.715214736     0.219465    746  TRUE   1.355378243

I want to make a new df with only the data from the subjects in the text file I load, but my code below hasn't been working. Is there something wrong with the data type of my subject list? or is it something else?

import pandas as pd

# load text file of subject #s
subject_list = open('subjects_visit2.txt', 'r')
lines = subject_list.read().split('\n')
subjs = list(lines)

newfile = pd.ExcelFile('amygdala_mPFC_data_pandas.xlsx')
df_ROI1 = newfile.parse("01")

# restrict to subject #s in text file 
print df_ROI1['Subject'].isin(subjs)

df_ROI1 = df_ROI1[df_ROI1['Subject'].isin(subjs)]

Upvotes: 0

Views: 257

Answers (2)

Jonathan DEKHTIAR
Jonathan DEKHTIAR

Reputation: 3536

You can use the following:

In [5]: df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})

In [6]: df
Out[6]:
   A  B
0  5  1
1  6  2
2  3  3
3  4  5

In [7]: df[df['A'].isin([3, 6])]
Out[7]:
   A  B
1  6  2
2  3  3

By the way, if you use a Notebook environment, it might be better to use :

df.head(n=5) # Gives you the first 5 rows of the dataframe
df.sample(n=5) # Gives you a random set of 5 rows of the dataframe

Edit1: What happens if you execute the following :

values_list = df_ROI1['Subject'].unique()

if "577" in values_list:
    print ("577 is in the dataframe and is a string")
elif 577 in values_list:
    print ("577 is in the dataframe and is an integer")
else:
    print ("577 is NOT in the dataframe")

Edit2:

So the mistake you did is you are giving the algorithm a string and not an integer.

Please try:

df_ROI1 = df_ROI1[df_ROI1['Subject'].isin([577])] # Without the quotes around 577
df_ROI1.head(n=5)

Upvotes: 2

TSeymour
TSeymour

Reputation: 739

You may need to pip install xlrd to make this work with .xlsx files Otherwise, save your data to .csv and use pd.read_csv()

Also, the data you posted seems to have 8 columns, but I think it's just 4, right? If not, then there is a problem with duplicate variable names that needs to get solved.

import pandas as pd

with open('subjects_visit2.txt', 'r') as infile:
    # put contents into a list without the newlines
    subject_list = infile.read().splitlines()

# convert subject_list to a list of integers
subject_list = [int(subject) for subject in subject_list]

# open data file and show 1st 5 rows
df = pd.read_excel('amygdala_mPFC_data_pandas.xlsx')
print(df.head())

# uses .query() which allows easy to read syntax.
# Note: The @ symbol allows access to objects not defined in the data frame
new_df = df.query('Subject in @subject_list')
print(new_df)

The output will look like this:

   Connectivity  Subject CanEstimate   Entropy
0      0.081676      870        True  0.308933
1      0.215038      577        True  0.448918
2      0.059674      651        True  0.476950
3      0.267082      817        True  0.500622
4      0.184070      567        True  0.508110

    Connectivity  Subject CanEstimate   Entropy
1       0.215038      577        True  0.448918
9      -0.032168      610        True  0.559391
28      0.155816      770        True  0.973656

Upvotes: 1

Related Questions