user2078890
user2078890

Reputation: 21

Comparing two csv files by non unique values from multiple columns, output to csv extra data where matched

I have run out of search terms and have come full circle. I've read and tried a ton of question and answers on what I thought would be a fairly simple task, but no joy.

I have two csv files.

**file1.csv** (2,000 + lines)<br/>
product_code, colour, size, sku, more cols..., barcode<br/>
<span style="color: red">
item98, red, XL, adfd56384678, ..., null<br/>
item99, black, L, adfgk9087467, ..., null<br/>
item98, red, S, adfgad240568, ..., null<br/>
</span>

**file2.csv** (20,000 + lines)<br/>
ITEM_CODE, COLOUR, SIZE, BAR_CODE<br/>
<span style="color: red">
item98, RED, XL, 090900887<br/>
item98, RED, S, 43581034<br/>
item97, BLUE, M, 519685371<br/>
</span>

I need to output:

**matched-result.csv** (2,000 + lines)<br/>
product_code, colour, size, sku, more cols..., barcode<br/>
<span style="color: red">
item98, red, XL, adfd56384678, ..., 090900887<br/>
item99, black, L, adfgk9087467, ..., null<br/>
item98, red, S, adfgad240568, ..., 519685371<br/>
</span>

The sku and barcode are unique values which can only be identified by matching product_code, size and colour. I need the barcode in a new column at the end of file1.csv. My first successful-ish attempt was using awk.

<pre>
awk -F',' -v OFS=, 'NR==FNR{a[$1,tolower($2),$3]=$4;next}{if( b = a[$1,tolower($2),$3]){ print $1,$2,$3,$4,b }}' file1.csv file2.csv > matched-result.csv
</pre>

I struggled outputting the whole of file1.csv with the results, trying print $0,b which created a new row with value of b. I would also like all of file1.csv output with empty values for non matches and the head if possible. I had to align the columns to match up for this awk approach to work, but the original fields don't align. This is no big deal to do, but I wonder if it's necessary of if there's a work around?

After that I tried to merge file1.csv with the awk output file matched-result.csv using a python script.

    import csv
    from collections import OrderedDict

    with open('file1.csv') as f:
        r = csv.reader(f, delimiter=',')
        dict1 = {row[0]: row[1:] for row in r}

    with open('matched-result.csv') as f:
        r = csv.reader(f, delimiter=',')
        dict2 = OrderedDict((row[0], row[1:]) for row in r)

    result = OrderedDict()
    for d in (dict1, dict2):
        for key, value in d.iteritems():
            result.setdefault(key, []).extend(value)

    with open('desired-result.csv', 'wb') as f:
        w = csv.writer(f)
        for key, value in result.iteritems():
            w.writerow([key] + value)

The output isn't the desired result. The order is wrong the number of records don't exactly match file1.csv there's an extra row? Also this two step approach seems disjointed and feels like one of these could do the job if done properly?

I have tried csvkit to join / merge the files but output 20,000 + rows some of which were duplicate. I think it treated the product_code/ ITEM_CODE columns as unique values, which they aren't. I have looked into join and grep but they don't seem to be the answer either.

I have installed panda and powerShell and willing to give them a go but don't know where to start with those, clear instructions needed there. Oh and i'm noob to all these languages and programs but not completely wet behind the ears.

hopefully I have given enough info to go on. I would have posted links to the posts of here I have read. You can take my work for it that it's +90% of them.

Please can replies, offer code examples and preferably not a completely new language or approach.

// UPDATE

I've voted for @zwer answer because it more or less worked out the box bar a minor tweak, see the comments. But it's worth pointing out that @RavinderSingh13 awk method worked as well with the same result. I will pursue @acushner panda method when / if I hear back and update this bit.

In case anyone cares, which I think is the best approach? Tomayto, tomarto, not sure that works in print. Personally, python is the easier to read. Awk is nice and short, I think it's a regex based language which I find harder to understand and was steeper learning curve.

Thanks to everyone who commented. I was wrestling with this for a long time and got it sorted on here pronto.

Upvotes: 1

Views: 1537

Answers (3)

acushner
acushner

Reputation: 9946

using pandas is pretty straightforward. you would just do something like this:

import pandas as pd

df_sku = pd.read_csv('file1.csv', index_col=False)
df_bc = pd.read_csv('file2.csv', index_col=False)
res = df_sku.merge(df_bc, left_on=['product_code', 'colour', 'size'], right_on=['ITEM_CODE', 'COLOUR', 'SIZE'])

give this a shot and let me know how you make out. additionally, if you're going to be doing a bunch of csv stuff, pandas is 100% worth learning.

Upvotes: 0

RavinderSingh13
RavinderSingh13

Reputation: 133670

I believe there is some inconsistency in your output, ideally last line of your shown output shouldn't be there if we are considering $1,$2 and $3 as index of file2.csv file.

If yes that is a typo then following may help you in same.

awk -F", " 'FNR==NR{;a[$1,tolower($2),$3]=$NF;next} (($1,$2,$3) in a){$NF=a[$1,$2,$3]} 1' file2.csv file1.csv

EDIT: As OP has control M characters in Input_file(s) so adding following to it too now.

awk -F", " '{gsub('/\r/',"")} FNR==NR{;a[$1,tolower($2),$3]=$NF;next} (($1,$2,$3) in a){$NF=a[$1,$2,$3]} 1' file2.csv file1.csv

Upvotes: 1

zwer
zwer

Reputation: 25809

A simple and versatile Python approach:

import csv

# load the second file in memory as we need it for a barcode lookup
barcodes = {}  # will hold a map for our barcode values
with open("file2.csv", "rU") as f:  # open file2.csv for reading
    reader = csv.reader(f)  # create a CSV reader, you may need to tweak its options
    next(reader)  # skip the header
    for row in reader:  # loop it row by row
        # store code:color:size map as a tuple in our barcodes map
        barcodes[(row[0], row[1].lower(), row[2])] = row[3]
# now you can get a barcode from it like: barcodes[("item98", "red", "XL")]

# open file1.csv for reading and matched-result.csv for writing
with open("file1.csv", "rU") as f, open("matched-result.csv", "w") as out:
    reader = csv.reader(f)  # create a CSV reader, you may need to tweak its options
    header = next(reader)  # get the header
    indexes = {column: index for index, column in enumerate(header)}  # map column indexes
    writer = csv.writer(out, lineterminator="\n")  # create a CSV writer for our output
    writer.writerow(header)  # write the same header to the output file
    for row in reader:  # loop through file1.csv rows
        # create a search map entry
        search = (row[indexes["product_code"]], row[indexes["colour"]], row[indexes["size"]])
        if search in barcodes:  # we have a barcode entry
            row[indexes["barcode"]] = barcodes[search]  # update it in the current row
        writer.writerow(row)  # write the potentially modified row to our output file

If there is a possibility of the file2.csv having columns in a different order you'll have to make sure that the mapping part ( barcodes[(row[0], row[1].lower(), row[2])] = row[3]) matches the column order, tho.

EDIT - Since I didn't test it before, I wrote a simple test to see if it's working so:

file1.csv:

product_code,colour,size,sku,added_col,barcode
item98,red,XL,adfd56384678,1,null
item99,black,L,adfgk9087467,2,null
item98,red,S,adfgad240568,3,null

file2.csv:

ITEM_CODE,COLOUR,SIZE,BAR_CODE
item98,RED,XL,090900887
item98,RED,S,43581034
item97,BLUE,M,519685371

And it results in matched-result.csv containing:

product_code,colour,size,sku,added_col,barcode
item98,red,XL,adfd56384678,1,090900887
item99,black,L,adfgk9087467,2,null
item98,red,S,adfgad240568,3,43581034

If your CSV files follow a structure as presented in the question, like these do, it should work just fine. The only problem I could potentially see in your data is that there is a space after a comma that separates the values. If that's indeed the case with your data add skipinitialspace=True to the arguments of each csv.reader() (e.g. reader = csv.reader(f, skipinitialspace=True)) to account for the extra space. As I wrote in the code, you might need to tweak the csv.reader()/csv.writer() options to match your CSV dialect.

Upvotes: 0

Related Questions