lindak
lindak

Reputation: 167

Iterate column for matches in another column

I have files that look like:

chr1:92092600   G[chr2:164084669[   ENSG00000189195 ENST00000342818 BTBD8   chr2:164084669
chr1:121498879  T[chr9:2781522[ ENSG00000233432 ENST00000425455 AL592494.2  chr9:2781522
chr2:101298260  ]chr3:196435392]A   ENSG00000163162 ENST00000295317 RNF149  chr3:196435392
chr2:164084669  ]chr1:92092600]G    ENSG00000237844 ENST00000429636 AC016766.1  chr1:92092600
chr9:2781522    ]chr1:121498879]T   ENSG00000080608 ENST00000490444 PUM3    chr1:121498879
chr3:196435392  A[chr2:101298260[   ENSG00000163960 ENST00000296328 UBXN7   chr2:101298260

And for every element in column 6 I would like to search column 1, and if present - print the entire line. So expected output for the first 3 elements in column 6 should look like:

chr2:164084669  ]chr1:92092600]G    ENSG00000237844 ENST00000429636 AC016766.1  chr1:92092600
chr9:2781522    ]chr1:121498879]T   ENSG00000080608 ENST00000490444 PUM3    chr1:121498879
chr3:196435392  A[chr2:101298260[   ENSG00000163960 ENST00000296328 UBXN7   chr2:101298260

So far I have:

import pandas as pd

pd.options.display.max_colwidth = 100
file =  open("data.txt", 'r')

chrA =[]
chrB = []
Bgenes = []

for line in file.readlines():
    chrA.append(line.split()[0])
    chrB.append(line.split()[5])
    for pos in chrB:
        if pos in chrA: 
            Bgenes.append(line)

Upvotes: 0

Views: 145

Answers (4)

Rebin
Rebin

Reputation: 526

I assumed that your data is separable by comma(you can add). The reason is your original data is having different amount of white space. here is the code and screen shot of the result which is what you want i guess.

import pandas as pd
data1 = pd.read_csv('C:/data.csv', sep=',', header=None)
data2 = pd.read_csv('C:/data.csv', sep=',', header=None)
df1=pd.DataFrame(data1) # create FIRST dataframe
df2=pd.DataFrame(data2) # create SECODN dataframe

df1.columns=['1','2','3','4','5','ID'] #assinging ID to column 6
df2.columns=['ID','2','3','4','5','6'] #assingning ID to column 1

dfMerged1=pd.merge(df1, df2, on='ID', how='inner') 
dfMerged2=pd.merge(df2, dfMerged1, on='ID', how='inner')

dfCleaned=dfMerged2.iloc[:,0:6] #what you want at the end
print(dfCleaned)

enter image description here

Upvotes: 0

Sainath Motlakunta
Sainath Motlakunta

Reputation: 945

You need to use a separate "for" loop for collecting and another loop for searching.

lines=file.readlines()
for line in lines: 
    for line2 in lines:
         if line.split()[5] ==line2.split()[0]:
             Bgenes.append(line2)

I hope this helps :)

Upvotes: 0

Alderven
Alderven

Reputation: 8260

You can also use list comprehension to find matches:

with open('data.txt', 'r') as f:
    lines = [line.split() for line in f.readlines()]

for line in lines:
    try:
        i = [x[0] for x in lines].index(line[5])
        print(' '.join(lines[i]))
    except IndexError:
        pass

Output:

chr2:164084669 ]chr1:92092600]G ENSG00000237844 ENST00000429636 AC016766.1 chr1:92092600
chr9:2781522 ]chr1:121498879]T ENSG00000080608 ENST00000490444 PUM3 chr1:121498879
chr3:196435392 A[chr2:101298260[ ENSG00000163960 ENST00000296328 UBXN7 chr2:101298260
chr1:92092600 G[chr2:164084669[ ENSG00000189195 ENST00000342818 BTBD8 chr2:164084669
chr1:121498879 T[chr9:2781522[ ENSG00000233432 ENST00000425455 AL592494.2 chr9:2781522
chr2:101298260 ]chr3:196435392]A ENSG00000163162 ENST00000295317 RNF149 chr3:196435392

Upvotes: 2

Gokhan Gerdan
Gokhan Gerdan

Reputation: 1470

First put your data in a pandas DataFrame, than you can use this:

import pandas as pd

df = pd.DataFrame({"a": ["asdf", "qwer", "zxcv"],
                   "b": ["b_row_1", "b_row_2", "b_row_3"],
                   "c": ["ghjk", "qwer", "zxcv"]})

for index, row in df.iterrows():
    if row["c"] not in df["a"].tolist():
        df = df.drop(index)

The output should look like this:

      a        b     c
1  qwer  b_row_2  qwer
2  zxcv  b_row_3  zxcv

You can use something like this to read your file as a pandas DataFrame:

data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]

Check these links:

Load data rom txt with pandas

How to iterate over rows in a datarame in pandas

Pandas dataframe drop

Upvotes: 1

Related Questions