chippycentra
chippycentra

Reputation: 879

Subset a data frame according to pattern matches in a file.txt pandas

I have a data frame such as

query   subject col1
A   dog ok
B   cat okl
C   cat oklp
D   frog    ok
E   cat ok
F   fox ok

and a file.txt such as:

dog
cat 

and the idea is to only keep row that have a pattern present in the file.txt. Here I should get :

query   subject col1
A   dog ok
B   cat okl
C   cat oklp
E   cat ok

I tried with :

file = open('file.txt').read()

df=[]
for row in tab['subject']:
 if row in file: 
   row.append(df)

but it does not seem to be the solution, thank you for your help.

Upvotes: 2

Views: 158

Answers (2)

Daniel Labbe
Daniel Labbe

Reputation: 2019

Considering that your data frame is called df, this answer reads the file.txt also as a dataframe and merge both data frames, resulting in the desired result - solution similar to an inner join from SQL:

>> df2 = pd.read_csv('file.txt', header= None, names=['subject'])
>> pd.merge(df, df2, on='subject')

    query   subject col1
0   A       dog     ok
1   B       cat     okl
2   C       cat     oklp
3   E       cat     ok

Upvotes: 2

Engineero
Engineero

Reputation: 12928

You can use df.loc[] to grab specific rows where a criterion is met, and isin() to check whether something in your dataframe shows up in a given list. First make a list of the entries in your file:

vals = []  # placeholder for our file contents
with open('file.txt', 'r') as a_file:
    for line in a_file:
        vals.append(line)

Next select rows where the subject column is one of the values read from your file. Assuming your dataframe is called df:

new_df = df.loc[df.subject.isin(vals)]

Side note: you almost never want to iterate over a dataframe. There are exceptions to this, but generally speaking if you find yourself iterating on a dataframe (e.g. for row in df:), you can rewrite that piece of code to take advantage of vectorized functions in Pandas which are much more efficient.

Upvotes: 3

Related Questions