pythonweb
pythonweb

Reputation: 1073

How to loop through mismatched data

So I have been trying to think of how this would be possible, but nothing comes to mind.

I have 2 csv files with dates and data in them. The dates, however, are mismatched. I want to be able to write a code that allows me to loop through all the data, find the matching dates, and then create a text document with the date in common and the 2 data values from the separate sheets. The other alternative is matching them myself in excel.

Any suggestions?

Ex. of what I am trying to accomplish:

Say I have this file1:

01/01/01 | 1 
01/02/01 | 2 
01/03/01 | 3 
01/04/01 | 4 
01/05/01 | 5 
01/06/01 | 6 
01/07/01 | 7 
01/08/01 | 8 

and this file2:

01/05/01 | 9  
01/06/01 | 22 
01/07/01 | 33 
01/08/01 | 44
01/09/01 | 55
01/10/01 | 66
01/11/01 | 77
01/12/01 | 88

What I would hope to have returned is (after running a loop):

01/05/01 | 5 | 9
01/06/01 | 6 | 22
01/07/01 | 7 | 33
01/08/01 | 8 | 44

Any help would be appreciated, thanks!

Upvotes: 0

Views: 119

Answers (4)

Luke Kot-Zaniewski
Luke Kot-Zaniewski

Reputation: 1161

I think pandas join logic is well suited for this task since you are effectively asking for an inner join as keyed by date.

First define a parsing function for your dates (which I added in case you needed to treat these as actual dates for further processing in python). Then load each csv file into a separate dataframe. Finally join and drop NaN values to get the inner join you seek (or just explicitly inner join by setting how parameter in join function). You can also easily drop the outptut dataframe into a csv file with the to_csv function.

If you are looking to get a little more creative with aggregating or joining then pandas is the way to go.

import pandas
parser = lambda date: pandas.datetime.strptime(date.strip(), '%y/%m/%d')
dt = pandas.read_csv('H:\\one.csv', parse_dates = [0], date_parser = parser, index_col=0, delimiter="|", names = ["date", "val"])
dt2 = pandas.read_csv('H:\\two.csv', parse_dates = [0], date_parser = parser, index_col=0, delimiter="|", names = ["date", "val2"])
dt.join(dt2).dropna().to_csv("H:\\output.csv")

This would also work but you lose all of the date-specific functionality within python:

import pandas
dt = pandas.read_csv('H:\\one.csv', index_col=0, delimiter="|", names = ["date", "val"])
dt2 = pandas.read_csv('H:\\two.csv', index_col=0, delimiter="|", names = ["date", "val2"])
dt.join(dt2, how='inner').to_csv("H:\\output.csv")

Upvotes: 1

mhawke
mhawke

Reputation: 87084

Use a defaultdict of lists to store the data from the CSV files, then find those items in the dict that have more than one value in the list.

import csv
from collections import defaultdict

merged = defaultdict(list)

for filename in 'f1.csv', 'f2.csv':
    with open(filename) as f:
        for date, value in csv.reader(f, delimiter='|'):
            merged[date].append(value)

with open('out.csv', 'w') as f:
    w = csv.writer(f, delimiter='|')
    for date in sorted(merged):
        if len(merged[date]) > 1:
            w.writerow([date] + merged[date])

This will (almost) produce the file that you require, the only difference being that the delimiter is a single | instead of | (with surrounding spaces). I also assumed that the input files are similarly delimited.

Upvotes: 1

A. J. Parr
A. J. Parr

Reputation: 8026

Using CSV module to read out both files, convert them to a dictionary which maps date => list[values], then prints them.

import csv

# Extract CSV
csv1 = []
with open('first.csv', 'r') as f1:
    csv_reader = csv.reader(f1)
    for row in csv_reader:
        csv1.append(row)

csv2 = []
with open('second.csv', 'r') as f2:
    csv_reader = csv.reader(f2)
    for row in csv_reader:
        csv2.append(row)

# Initialize dict which will map dates to values
d = {}

# Map dates to values by getting a default array and appending the values
for row in csv1:
    v = d.get(row[0], [])
    v.append(row[1])
    d[row[0]] = v

for row in csv2:
    v = d.get(row[0], [])
    v.append(row[1])
    d[row[0]] = v

# Print results
for k, v in d.items():
    values = ' | '.join(list(map(str, v)))
    print(f'{k} | {values}')

Upvotes: 0

Prune
Prune

Reputation: 77857

  • Read the first record from each file, r1 and r2
  • Extract dates d1 and d2
  • while there's remaining data in both files:
    • if d1 == d2:
      • merge records and print
      • get next line of each file (new r1 and r2)
    • else if d1 < d2
      • get next r1
    • else
      • get next r2

Briefly, you're stepping through both files, keeping a "bookmark" in each. If the records match, merge them and print; advance both files. Otherwise, advance the bookmark that's "behind" in the date.

Upvotes: 0

Related Questions