Reputation: 63
I have two CSV files, both containing columns for date and time. For EACH row, I need to match both time and date in CSV 1 and CSV 2, and extract the Weather from CSV 2.
CSV 1:
Date Time Value
2017/04/20 12:00 100
2017/03/20 12:00 250
2017/03/20 12:00 300
2017/02/20 12:00 80
2017/02/20 12:00 500
CSV 2:
Date Time Weather
2017/02/20 12:00 Sunny
2017/02/20 12:00 Sunny
2017/03/20 12:00 Sunny
2017/03/20 12:00 Sunny
2017/04/20 12:00 Sunny
I don't know if it's efficient, but the first thing I did was to append the CSV files to two python lists:
list1 = []
list2 = []
for row in CSV1:
list1.append(row)
for row in CSV2:
list2.append(row)
I then for each row in list1, grab the date and type, and immediately loop through each row in list2 until the elements match.
for row in list1:
published_date = row[0]
published_time = row[1]
for rows in list2:
if published_date == rows[0] and published_time == rows[1]:
"do something with rows[2]"
break
This works, but CSV1 has 1700 rows and CSV2 has 1.000.000 rows, so this process takes 150 seconds. Is there a significantly faster way?
I know there are solutions where one element must be matched, but here it's 2, and I couldn't tweak the one-element solutions to work.
I'm new to Stack Overflow so if I did something wrong in this post, please inform me.
Upvotes: 3
Views: 461
Reputation: 25023
You need to check that a tuple (row[0], row[1])
has already be seen in another file.
The most natural data structure for this is a set
.
You first have a loop over the smaller file to setup your set, next a loop on the larger one to check its contents against your saved data.
dates_times = {(items[0], items[1]) for items in (line.split() for lines in CSV1)}
for line in CSV2:
items = line.split()
if (items[0], items[1]) in dates_times:
do_someting_with(items[2]
Upvotes: 1
Reputation: 232
I would recommend that your check out the "pandas" library in Python. It may help you with the efficiency question. I was curious and implemented the problem in Pandas and I was able to complete the promblem using some dummy data in 373 milliseconds.
You can play around the following code to gauge the frameworks use.
# Generating some random samples
import pandas as pd
date_range = pd.date_range(start='2017-04-20',periods=1700)
time_range = pd.to_datetime(pd.date_range('12:00', freq='1H', periods=1700).strftime('%H:%m'))
values = np.arange(0,1700)
weather = np.random.choice(['rain','sunny','windy'],size=1700, replace=True)
# Putting the Random Data into a DataFrame
df1 = pd.DataFrame({'Date':date_range,
'Time':time_range,
'Value':values})
df2 = pd.DataFrame({'Date':np.random.choice(date_range, size=1000000, replace=True),
'Time': np.random.choice(time_range, size=1000000, replace=True),
'Weather':np.random.choice(weather, size=1000000, replace=True)})
# Mergind the Datatogther on the Date and Time Columns
df3 = pd.merge(df1,df2,on=['Date','Time'], how='inner')
df3
Upvotes: 4
Reputation: 106543
List is inefficient at finding membership. Use dict instead in your case, since you need to also map date and time to the weather. You can read CSV2 into a dict indexed by a tuple of date and time:
weather_history = {tuple(row[:2]): row[2] for row in CSV2}
So it'll be a dict like:
{('2017/02/20', '12:00'): 'Sunny', ('2017/02/20', '12:00'): 'Sunny', ... }
So that you can then perform the lookup much more efficiently:
for row in list1:
published_date, published_time = row[:2]
if (published_date, published_time) in weather_history:
# do something with weather_history[(published_date, published_time)], which is the weather of that date and time
Upvotes: 1