Reputation: 63

Match multiple elements in multiple CSV files

I have two CSV files, both containing columns for date and time. For EACH row, I need to match both time and date in CSV 1 and CSV 2, and extract the Weather from CSV 2.

CSV 1:

    Date           Time    Value
    2017/04/20     12:00   100
    2017/03/20     12:00   250
    2017/03/20     12:00   300
    2017/02/20     12:00   80
    2017/02/20     12:00   500

CSV 2:

    Date           Time    Weather
    2017/02/20     12:00   Sunny
    2017/02/20     12:00   Sunny
    2017/03/20     12:00   Sunny
    2017/03/20     12:00   Sunny
    2017/04/20     12:00   Sunny

I don't know if it's efficient, but the first thing I did was to append the CSV files to two python lists:

list1 = []
list2 = []
for row in CSV1:
    list1.append(row)
for row in CSV2:
    list2.append(row)

I then for each row in list1, grab the date and type, and immediately loop through each row in list2 until the elements match.

for row in list1:
    published_date = row[0]
    published_time = row[1]
    for rows in list2:
        if published_date == rows[0] and published_time == rows[1]:
            "do something with rows[2]"
            break

This works, but CSV1 has 1700 rows and CSV2 has 1.000.000 rows, so this process takes 150 seconds. Is there a significantly faster way?

I know there are solutions where one element must be matched, but here it's 2, and I couldn't tweak the one-element solutions to work.

I'm new to Stack Overflow so if I did something wrong in this post, please inform me.

Upvotes: 3

Answers (3)

gboffi

Reputation: 25023

You need to check that a tuple (row[0], row[1]) has already be seen in another file.

The most natural data structure for this is a set.

You first have a loop over the smaller file to setup your set, next a loop on the larger one to check its contents against your saved data.

dates_times = {(items[0], items[1]) for items in (line.split() for lines in CSV1)}
for line in CSV2:
    items = line.split()
    if (items[0], items[1]) in dates_times:
        do_someting_with(items[2]

Upvotes: 1

Kavi Sek

Reputation: 232

I would recommend that your check out the "pandas" library in Python. It may help you with the efficiency question. I was curious and implemented the problem in Pandas and I was able to complete the promblem using some dummy data in 373 milliseconds.

You can play around the following code to gauge the frameworks use.

# Generating some random samples
import pandas as pd
date_range = pd.date_range(start='2017-04-20',periods=1700)
time_range = pd.to_datetime(pd.date_range('12:00', freq='1H', periods=1700).strftime('%H:%m'))
values = np.arange(0,1700)
weather = np.random.choice(['rain','sunny','windy'],size=1700, replace=True)

# Putting the Random Data into a DataFrame
df1 = pd.DataFrame({'Date':date_range,
                    'Time':time_range,
                   'Value':values})

df2 = pd.DataFrame({'Date':np.random.choice(date_range, size=1000000, replace=True),
                    'Time': np.random.choice(time_range, size=1000000, replace=True),
                   'Weather':np.random.choice(weather, size=1000000, replace=True)})

# Mergind the Datatogther on the Date and Time Columns
df3 = pd.merge(df1,df2,on=['Date','Time'], how='inner')
df3

Upvotes: 4

blhsing

Reputation: 106543

List is inefficient at finding membership. Use dict instead in your case, since you need to also map date and time to the weather. You can read CSV2 into a dict indexed by a tuple of date and time:

weather_history = {tuple(row[:2]): row[2] for row in CSV2}

So it'll be a dict like:

{('2017/02/20', '12:00'): 'Sunny',  ('2017/02/20', '12:00'): 'Sunny', ... }

So that you can then perform the lookup much more efficiently:

for row in list1:
    published_date, published_time = row[:2]
    if (published_date, published_time) in weather_history:
        # do something with weather_history[(published_date, published_time)], which is the weather of that date and time

Upvotes: 1

Match multiple elements in multiple CSV files

Answers (3)

Related Questions