Reputation: 3557
I've got two text files which both have index lines. I want to compare file1 and file2 and send the similar lines to a new text file. I've been googling this for awhile now and have been trying grep in various forms but I feel I'm getting in over my head. What I'd like ultimately is to see the 'Mon-######' from file2 that appear in file1 and print the lines from file1 which correspond.
(The files are much larger, I cut them down for brevity's sake)
For even greater clarity:
file1 has entries of the form:
Mon-000101 100.27242 9.608597 11.082 10.034
Mon-000102 100.18012 9.520860 12.296 12.223
file2 has entries of the form:
Mon-000101
Mon-000171
So, if the identifier (Mon-000101 for instance) from file2 is listed in file1 I want the entire line that begins with Mon-000101 printed into a separate file. If it isn't listed in file2 it can be discarded.
So if the files were only as large as the above files the newly produced file would have the single entry of
Mon-000101 100.27242 9.608597 11.082 10.034
because that's the only one common to the both.
Upvotes: 0
Views: 2310
Reputation: 23364
Using grep
and sed
on bash
. This may not be very performant with very large files.
grep -f <(sed 's/^/^/' file2.txt) file1.txt
Upvotes: 0
Reputation: 174624
As the files may be big, how about this approach; which uses sqlite to handle the file manipulation:
import sqlite3
import csv
import os
conn = sqlite3.connect('temp.db')
c = conn.cursor()
c.execute('''CREATE TABLE master
(id text, ra text, dec text, mean1 text, mean2 text)''')
conn.commit() # Write changes
with open('master.csv') as f:
reader = csv.reader(f, delimiter=',')
next(reader) # skips header
for row in reader:
c.execute('INSERT INTO master VALUES (?,?,?,?,?)', row)
conn.commit()
with open('filter.txt') as f, open('diff.txt','w') as out:
writer = csv.writer(out, delimiter=',')
writer.writerow(('NAME','RA','DEC','Mean_I1','Mean_I2'))
for line in f:
c.execute('SELECT * FROM master WHERE id = ?',(line.strip(),))
row = c.fetchone()
if row:
writer.writerow(row)
conn.close()
os.remove('temp.db')
Upvotes: 0
Reputation: 20028
One way to solve this (provided the files aren't too large) would be to read in file1
and store the data as a dict
where each line is keyed by the index (first column) and the data (remaining columns). Then read file2
as a list of keys, which you can then use as a generator to extract matching lines from the data in file1
.
A quick and dirty solution:
#!/usr/bin/env python
DATA_FILE='file1.txt'
KEY_FILE='file2.txt'
# Read a list of keys to search for
keys = []
lineno = 1
for line in open(KEY_FILE):
if lineno > 1:
keys.append(line.strip())
lineno += 1
# Read data
data = {}
lineno = 1
for line in open(DATA_FILE):
if lineno > 1:
fields = line.split()
data[fields[0]] = fields[1:]
lineno += 1
# Extract data using keys
extracted_data = [[k, data[k]] for k in keys if k in data]
for k, v in extracted_data:
print k, ' '.join(v)
There's probably more efficient ways of doing this, but this will do the job, and allow you to put more logic in as required.
Upvotes: 0
Reputation: 353129
Since from earlier questions you're at least a little familiar with pandas, how about:
import pandas as pd
df1 = pd.read_csv("file1.csv", sep=r"\s+")
df2 = pd.read_csv("file2.csv", sep=r"\s+")
merged = df1.merge(df2.rename_axis({"Mon-id": "NAME"}))
merged.to_csv("merged.csv", index=False)
Some explanation (note that I've modified file2.csv
so that there are more elements in common) follows.
First, read the data:
>>> import pandas as pd
>>> df1 = pd.read_csv("file1.csv", sep=r"\s+")
>>> df2 = pd.read_csv("file2.csv", sep=r"\s+")
>>> df1.head()
NAME RA DEC Mean_I1 Mean_I2
0 Mon-000101 100.27242 9.608597 11.082 10.034
1 Mon-000102 100.18012 9.520860 12.296 12.223
2 Mon-000103 100.24811 9.586362 9.429 9.010
3 Mon-000104 100.26741 9.867225 11.811 11.797
4 Mon-000105 100.21005 9.814060 12.087 12.090
>>> df2.head()
Mon-id
0 Mon-000101
1 Mon-000121
2 Mon-000131
3 Mon-000141
4 Mon-000151
Then, we can rename the axis in df2:
>>> df2.rename_axis({"Mon-id": "NAME"}).head()
NAME
0 Mon-000101
1 Mon-000121
2 Mon-000131
3 Mon-000141
4 Mon-000151
and after that, merge
will simply do the right thing:
>>> merged = df1.merge(df2.rename_axis({"Mon-id": "NAME"}))
>>> merged
NAME RA DEC Mean_I1 Mean_I2
0 Mon-000101 100.27242 9.608597 11.082 10.034
1 Mon-000121 100.45421 9.685027 11.805 11.777
2 Mon-000131 100.20533 9.397307 -100.000 11.764
3 Mon-000141 100.26134 9.388555 -100.000 12.571
Finally, we can write this out, telling it not to add an index column:
>>> merged.to_csv("output.csv", index=False)
producing a file which looks like
NAME,RA,DEC,Mean_I1,Mean_I2
Mon-000101,100.27242,9.608597,11.082,10.034
Mon-000121,100.45421,9.685027,11.805,11.777
Mon-000131,100.20533,9.397307,-100.0,11.764
Mon-000141,100.26134,9.388555,-100.0,12.571
Upvotes: 0
Reputation: 12142
Since you added the python tag, it seems you want something like this:
import csv
f = open('file2')
l = set([l.strip() for l in f.readlines()])
with open('file1', 'rb') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(10024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
cnt = 0
for item in reader:
if cnt >0:
data = item[0]
if data in l:
print item
cnt = cnt + 1
Upvotes: 1