Reputation: 3557

Comparing two files and printing similar lines

I've got two text files which both have index lines. I want to compare file1 and file2 and send the similar lines to a new text file. I've been googling this for awhile now and have been trying grep in various forms but I feel I'm getting in over my head. What I'd like ultimately is to see the 'Mon-######' from file2 that appear in file1 and print the lines from file1 which correspond.

(The files are much larger, I cut them down for brevity's sake)

For even greater clarity:

file1 has entries of the form:

Mon-000101  100.27242   9.608597   11.082   10.034
Mon-000102  100.18012   9.520860   12.296   12.223

file2 has entries of the form:

Mon-000101
Mon-000171

So, if the identifier (Mon-000101 for instance) from file2 is listed in file1 I want the entire line that begins with Mon-000101 printed into a separate file. If it isn't listed in file2 it can be discarded.

So if the files were only as large as the above files the newly produced file would have the single entry of

Mon-000101  100.27242   9.608597   11.082   10.034

because that's the only one common to the both.

Upvotes: 0

Answers (6)

Roman Newaza

Reputation: 11690

$ join <(sort file1) <(sort file2) > duplicated-lines

Upvotes: 1

iruvar

Reputation: 23364

Using grep and sed on bash. This may not be very performant with very large files.

grep -f <(sed 's/^/^/' file2.txt) file1.txt

Upvotes: 0

Burhan Khalid

Reputation: 174624

As the files may be big, how about this approach; which uses sqlite to handle the file manipulation:

import sqlite3
import csv
import os

conn = sqlite3.connect('temp.db')

c = conn.cursor()
c.execute('''CREATE TABLE master
          (id text, ra text, dec text, mean1 text, mean2 text)''')
conn.commit() # Write changes

with open('master.csv') as f:
    reader = csv.reader(f, delimiter=',')
    next(reader) # skips header
    for row in reader:
        c.execute('INSERT INTO master VALUES (?,?,?,?,?)', row)
        conn.commit()

with open('filter.txt') as f, open('diff.txt','w') as out:
    writer = csv.writer(out, delimiter=',')
    writer.writerow(('NAME','RA','DEC','Mean_I1','Mean_I2'))
    for line in f:
         c.execute('SELECT * FROM master WHERE id = ?',(line.strip(),))
         row = c.fetchone()
         if row:
             writer.writerow(row)
conn.close()
os.remove('temp.db')

Upvotes: 0

gavinb

Reputation: 20028

One way to solve this (provided the files aren't too large) would be to read in file1 and store the data as a dict where each line is keyed by the index (first column) and the data (remaining columns). Then read file2 as a list of keys, which you can then use as a generator to extract matching lines from the data in file1.

A quick and dirty solution:

#!/usr/bin/env python

DATA_FILE='file1.txt'
KEY_FILE='file2.txt'

# Read a list of keys to search for
keys = []
lineno = 1
for line in open(KEY_FILE):
    if lineno > 1:
        keys.append(line.strip())
    lineno += 1

# Read data 
data = {}
lineno = 1
for line in open(DATA_FILE):
    if lineno > 1:
        fields = line.split()
        data[fields[0]] = fields[1:]
    lineno += 1

    # Extract data using keys

extracted_data = [[k, data[k]] for k in keys if k in data]

for k, v in extracted_data:
    print k, ' '.join(v)

There's probably more efficient ways of doing this, but this will do the job, and allow you to put more logic in as required.

Upvotes: 0

DSM

Reputation: 353129

Since from earlier questions you're at least a little familiar with pandas, how about:

import pandas as pd
df1 = pd.read_csv("file1.csv", sep=r"\s+")
df2 = pd.read_csv("file2.csv", sep=r"\s+")
merged = df1.merge(df2.rename_axis({"Mon-id": "NAME"}))
merged.to_csv("merged.csv", index=False)

Some explanation (note that I've modified file2.csv so that there are more elements in common) follows.

First, read the data:

>>> import pandas as pd
>>> df1 = pd.read_csv("file1.csv", sep=r"\s+")
>>> df2 = pd.read_csv("file2.csv", sep=r"\s+")
>>> df1.head()
         NAME         RA       DEC  Mean_I1  Mean_I2
0  Mon-000101  100.27242  9.608597   11.082   10.034
1  Mon-000102  100.18012  9.520860   12.296   12.223
2  Mon-000103  100.24811  9.586362    9.429    9.010
3  Mon-000104  100.26741  9.867225   11.811   11.797
4  Mon-000105  100.21005  9.814060   12.087   12.090
>>> df2.head()
       Mon-id
0  Mon-000101
1  Mon-000121
2  Mon-000131
3  Mon-000141
4  Mon-000151

Then, we can rename the axis in df2:

>>> df2.rename_axis({"Mon-id": "NAME"}).head()
         NAME
0  Mon-000101
1  Mon-000121
2  Mon-000131
3  Mon-000141
4  Mon-000151

and after that, merge will simply do the right thing:

>>> merged = df1.merge(df2.rename_axis({"Mon-id": "NAME"}))
>>> merged
         NAME         RA       DEC  Mean_I1  Mean_I2
0  Mon-000101  100.27242  9.608597   11.082   10.034
1  Mon-000121  100.45421  9.685027   11.805   11.777
2  Mon-000131  100.20533  9.397307 -100.000   11.764
3  Mon-000141  100.26134  9.388555 -100.000   12.571

Finally, we can write this out, telling it not to add an index column:

>>> merged.to_csv("output.csv", index=False)

producing a file which looks like

NAME,RA,DEC,Mean_I1,Mean_I2
Mon-000101,100.27242,9.608597,11.082,10.034
Mon-000121,100.45421,9.685027,11.805,11.777
Mon-000131,100.20533,9.397307,-100.0,11.764
Mon-000141,100.26134,9.388555,-100.0,12.571

Upvotes: 0

carlosdc

Reputation: 12142

Since you added the python tag, it seems you want something like this:

import csv
f = open('file2')
l = set([l.strip() for l in f.readlines()])
with open('file1', 'rb') as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(10024))
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    cnt = 0
    for item in reader:
        if cnt >0:
           data = item[0]
           if data in l:
               print item
        cnt = cnt + 1

Upvotes: 1

Comparing two files and printing similar lines

Answers (6)

Related Questions