Reputation: 75
I got an interesting problem:
file1.csv has a few hundred rows like:
Code,DTime
1,2010-12-26 17:01
2,2010-12-26 17:07
2,2010-12-26 17:15
file2.csv has about 11 million rows like:
id,D,Sym,DateTime,Bid,Ask
1375022797,D,USD,2010-12-26 17:00:15,1.311400,1.311700
1375022965,D,USD,2010-12-26 17:00:56,1.311200,1.311500
1375022984,D,USD,2010-12-26 17:00:56,1.311300,1.311600
1375023013,D,USD,2010-12-26 17:01:01,1.311200,1.311500
1375023039,D,USD,2010-12-26 17:01:02,1.311100,1.311400
1375023055,D,USD,2010-12-26 17:01:03,1.311200,1.311500
1375023063,D,USD,2010-12-26 17:01:03,1.311300,1.311600
What i'm trying to do is to write a script that takes each DTime value in file1.csv and finds the first instance of a partial match in the DateTime column of file2.csv, and outputs DateTime, Bid, Ask for that row. The partial match is on the first 16 characters.
Both files are sorted from oldest to newest, so if "2010-12-26 17:01" from file1.csv matched 4 entries in file2.csv, I only need to extract the first one: "2010-12-26 17:01:01"
Not sure how to proceed.. I tried a dictionary but the order of values is important so i'm not sure if that would work. Maybe bring file1's DTime column into a list and for each entry in that list search DateTime in file2?
Thanks guys
Upvotes: 4
Views: 23762
Reputation: 50200
Unless you only need to do this once, you should really use a database. Add a column to table2 that contains DATETIME without the seconds, so that you can join on exact matches, not with LIKE.
It WILL be fast, and even faster if you index those columns. And if you can store file1.csv in the database too, you don't need iterations: You can get the entire set of results in a single select query. This is the kind of stuff SQL is made for.
PS. If you decide to pursue this approach, you can ask for help with the query.
Upvotes: 3
Reputation: 36715
If you don't have duplicate DTime
values, this should work:
import csv
file1reader = csv.reader(open("file1.csv"), delimiter=",")
file2reader = csv.reader(open("file2.csv"), delimiter=",")
header1 = file1reader.next() #header
header2 = file2reader.next() #header
for Code, DTime in file1reader:
for id_, D, Sym, DateTime, Bid, Ask in file2reader:
if DateTime.startswith(DTime): # found it
print DateTime, Bid, Ask # output data
break # break and continue where we left next time
Edit
import csv
from datetime import datetime
file1reader = csv.reader(open("file1.csv"), delimiter=",")
file2reader = csv.reader(open("file2.csv"), delimiter=",")
header1 = file1reader.next() #header
header2 = file2reader.next() #header
for Code, DTime in file1reader:
DTime = datetime.strptime(DTime, "%Y-%m-%d %H:%M")
for id_, D, Sym, DateTime, Bid, Ask in file2reader:
DateTime = datetime.strptime(DateTime, "%Y-%m-%d %H:%M:%S")
if DateTime>=DTime: # found it
print DateTime, Bid, Ask # output data
break # break and continue where we left next time
Upvotes: 6
Reputation: 49205
you can create a dictionary from file2, where the key is the prefix of the time you want, and the value is either first row, or all the rows matching this prefix. then it's simply a matter of doing something like:
entries = file2Dict.get(file1Entry)
if entries:
print "First entry is %s" entries[0]
Upvotes: 1