Reputation: 1297
I have a text file of subject #s like this (called subjects_visit1.txt):
577
610
650
770
883
and an Excel sheet that looks like this:
Low Connectivity Subject CanEstimate Entropy High Connectivity Subject CanEstimate Entropy
0.0816764 870 TRUE 0.308933317 -0.0313064 668 TRUE 0.868862941
0.215038 577 TRUE 0.448918189 0.172506 600 TRUE 0.885315753
0.0596745 651 TRUE 0.47695019 0.268619 595 TRUE 0.896439952
0.267082 817 TRUE 0.500621849 -0.0507346 851 TRUE 0.907089718
0.18407 567 TRUE 0.508109648 0.182189 822 TRUE 0.915782923
0.0326328 731 TRUE 0.517241379 0.201325 623 TRUE 0.929279958
0.237822 625 TRUE 0.518493071 0.511613 622 TRUE 0.953520938
0.246291 913 TRUE 0.548079129 0.101731 850 TRUE 0.956564212
0.182494 619 TRUE 0.554281617 0.0195069 823 TRUE 0.958840854
-0.0321676 610 TRUE 0.55939053 0.0610047 632 TRUE 0.960237986
0.198884 655 TRUE 0.581442494 0.155816 770 TRUE 0.973656398
0.029618 631 TRUE 0.620796248 0.0703396 754 TRUE 1.012278949
0.205221 866 TRUE 0.630981714 0.19077 804 TRUE 1.023361826
-0.00397881 842 TRUE 0.658492788 0.115125 830 TRUE 1.033213695
0.193168 880 TRUE 0.665481783 -0.0440176 621 TRUE 1.035325469
0.0187139 838 TRUE 0.670966904 0.231593 603 TRUE 1.087118914
-0.0483586 829 TRUE 0.678253186 0.720004 732 TRUE 1.229303773
0.238947 634 TRUE 0.715214736 0.219465 746 TRUE 1.355378243
I want to make a new df with only the data from the subjects in the text file I load, but my code below hasn't been working. Is there something wrong with the data type of my subject list? or is it something else?
import pandas as pd
# load text file of subject #s
subject_list = open('subjects_visit2.txt', 'r')
lines = subject_list.read().split('\n')
subjs = list(lines)
newfile = pd.ExcelFile('amygdala_mPFC_data_pandas.xlsx')
df_ROI1 = newfile.parse("01")
# restrict to subject #s in text file
print df_ROI1['Subject'].isin(subjs)
df_ROI1 = df_ROI1[df_ROI1['Subject'].isin(subjs)]
Upvotes: 0
Views: 257
Reputation: 3536
You can use the following:
In [5]: df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
In [6]: df
Out[6]:
A B
0 5 1
1 6 2
2 3 3
3 4 5
In [7]: df[df['A'].isin([3, 6])]
Out[7]:
A B
1 6 2
2 3 3
By the way, if you use a Notebook environment, it might be better to use :
df.head(n=5) # Gives you the first 5 rows of the dataframe
df.sample(n=5) # Gives you a random set of 5 rows of the dataframe
Edit1: What happens if you execute the following :
values_list = df_ROI1['Subject'].unique()
if "577" in values_list:
print ("577 is in the dataframe and is a string")
elif 577 in values_list:
print ("577 is in the dataframe and is an integer")
else:
print ("577 is NOT in the dataframe")
Edit2:
So the mistake you did is you are giving the algorithm a string and not an integer.
Please try:
df_ROI1 = df_ROI1[df_ROI1['Subject'].isin([577])] # Without the quotes around 577
df_ROI1.head(n=5)
Upvotes: 2
Reputation: 739
You may need to pip install xlrd
to make this work with .xlsx files
Otherwise, save your data to .csv and use pd.read_csv()
Also, the data you posted seems to have 8 columns, but I think it's just 4, right? If not, then there is a problem with duplicate variable names that needs to get solved.
import pandas as pd
with open('subjects_visit2.txt', 'r') as infile:
# put contents into a list without the newlines
subject_list = infile.read().splitlines()
# convert subject_list to a list of integers
subject_list = [int(subject) for subject in subject_list]
# open data file and show 1st 5 rows
df = pd.read_excel('amygdala_mPFC_data_pandas.xlsx')
print(df.head())
# uses .query() which allows easy to read syntax.
# Note: The @ symbol allows access to objects not defined in the data frame
new_df = df.query('Subject in @subject_list')
print(new_df)
The output will look like this:
Connectivity Subject CanEstimate Entropy
0 0.081676 870 True 0.308933
1 0.215038 577 True 0.448918
2 0.059674 651 True 0.476950
3 0.267082 817 True 0.500622
4 0.184070 567 True 0.508110
Connectivity Subject CanEstimate Entropy
1 0.215038 577 True 0.448918
9 -0.032168 610 True 0.559391
28 0.155816 770 True 0.973656
Upvotes: 1