user5511186
user5511186

Reputation: 43

Compared data from 2 files

I'm just starting to learn so sorry about any confusion.

I have 2 files. File A has the list of samples names I'm interested in. And File B has the data from all samples.

File A (no headers)

sample_A
sample_XA
sample_12754
samples_75t

File B

name                  description      etc .....
sample_JA                mm           0.01         0.1     1.2      0.018  etc
sample_A                 mm           0.001        1.2     0.8      1.4    etc
sample_XA                hu           0.4          0.021   0.14     2.34   etc
samples_YYYY             RN           0.0001       3.435   1.1      0.01   etc
sample_12754             mm           0.1          0.1     0.87     0.54   etc
sample_2248333           hu           0.43         0.01    0.11     2.32   etc
samples_75t              mm           0.3          0.02    0.14     2.34   etc

I want to compare file A to file B and output the data from B but only for the sample names listed in A.

I tried this.

#!/usr/bin/env python2

import csv

count = 0

import collections
samples = collections.defaultdict(list)
with open('FILEA.txt') as d:
sites = [l.strip() for l in f if l.strip()]      

###This gives me the correct list of samples for file A.

with open('FILEB','r') as inF:
   for line in inF:
       elements = line.split()
       if sites.intersection(elements):
          count += 1

          print (elements)

## Here I get the names of all samples in file B and only the names.I want the data that is in file B but just for the samples in A.

Then I tried using and intersection.

#!/usr/bin/env python2

 import sys
 import csv
 import collections

 samples = collections.defaultdict(list)
 with open('FILEA.txt','r') as f:
   nsamples = [l.strip() for l in f if l.strip()] 

 print (nsamples)

 with open ('FILEB','r') as inF:
   for row in inF:
     elements = row.split()
     if nsamples.intersection(elements):
        print(row[0,:])

Still doesn't work.

What do I have to do to get the output data as follows:
name                  description      etc .....
sample_A                 mm           0.001        1.2     0.8       1.4   etc
sample_XA                hu           0.4          0.021   0.14      2.34  etc
sample_12754             mm           0.1          0.1     0.87      0.54  etc
sample_75t               mm           0.3          0.02    0.14      2.34  etc

Any ideas will be very much appreciated. Thanks.

Upvotes: 1

Views: 47

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

Make a set of the lines from filea then split each line from fileb once and see if the first element is in the set of data from filea:

with open("filea") as f, open("fileb") as f2:
    # male set of lines stripping newlines
    # so we can compare properly later i.e foo\n != foo
    st  = set(map(str.rstrip, f)) # itertools.imap python2
    for line in f2:
        # split once and extract first element to compare
        if line.strip() and line.split(None, 1)[0] in st:
            print(line.rstrip())

Output:

sample_A                 mm           0.001        1.2     0.8      1.4    etc
sample_XA                hu           0.4          0.021   0.14     2.34   etc
sample_12754             mm           0.1          0.1     0.87     0.54   etc
samples_75t              mm           0.3          0.02    0.14     2.34   etc

Upvotes: 3

Related Questions