Reputation: 21
I have run out of search terms and have come full circle. I've read and tried a ton of question and answers on what I thought would be a fairly simple task, but no joy.
I have two csv files.
**file1.csv** (2,000 + lines)<br/>
product_code, colour, size, sku, more cols..., barcode<br/>
<span style="color: red">
item98, red, XL, adfd56384678, ..., null<br/>
item99, black, L, adfgk9087467, ..., null<br/>
item98, red, S, adfgad240568, ..., null<br/>
</span>
**file2.csv** (20,000 + lines)<br/>
ITEM_CODE, COLOUR, SIZE, BAR_CODE<br/>
<span style="color: red">
item98, RED, XL, 090900887<br/>
item98, RED, S, 43581034<br/>
item97, BLUE, M, 519685371<br/>
</span>
I need to output:
**matched-result.csv** (2,000 + lines)<br/>
product_code, colour, size, sku, more cols..., barcode<br/>
<span style="color: red">
item98, red, XL, adfd56384678, ..., 090900887<br/>
item99, black, L, adfgk9087467, ..., null<br/>
item98, red, S, adfgad240568, ..., 519685371<br/>
</span>
The sku and barcode are unique values which can only be identified by matching product_code, size and colour. I need the barcode in a new column at the end of file1.csv. My first successful-ish attempt was using awk.
<pre>
awk -F',' -v OFS=, 'NR==FNR{a[$1,tolower($2),$3]=$4;next}{if( b = a[$1,tolower($2),$3]){ print $1,$2,$3,$4,b }}' file1.csv file2.csv > matched-result.csv
</pre>
I struggled outputting the whole of file1.csv with the results, trying print $0,b which created a new row with value of b. I would also like all of file1.csv output with empty values for non matches and the head if possible. I had to align the columns to match up for this awk approach to work, but the original fields don't align. This is no big deal to do, but I wonder if it's necessary of if there's a work around?
After that I tried to merge file1.csv with the awk output file matched-result.csv using a python script.
import csv from collections import OrderedDict with open('file1.csv') as f: r = csv.reader(f, delimiter=',') dict1 = {row[0]: row[1:] for row in r} with open('matched-result.csv') as f: r = csv.reader(f, delimiter=',') dict2 = OrderedDict((row[0], row[1:]) for row in r) result = OrderedDict() for d in (dict1, dict2): for key, value in d.iteritems(): result.setdefault(key, []).extend(value) with open('desired-result.csv', 'wb') as f: w = csv.writer(f) for key, value in result.iteritems(): w.writerow([key] + value)
The output isn't the desired result. The order is wrong the number of records don't exactly match file1.csv there's an extra row? Also this two step approach seems disjointed and feels like one of these could do the job if done properly?
I have tried csvkit to join / merge the files but output 20,000 + rows some of which were duplicate. I think it treated the product_code/ ITEM_CODE columns as unique values, which they aren't. I have looked into join and grep but they don't seem to be the answer either.
I have installed panda and powerShell and willing to give them a go but don't know where to start with those, clear instructions needed there. Oh and i'm noob to all these languages and programs but not completely wet behind the ears.
hopefully I have given enough info to go on. I would have posted links to the posts of here I have read. You can take my work for it that it's +90% of them.
Please can replies, offer code examples and preferably not a completely new language or approach.
// UPDATE
I've voted for @zwer answer because it more or less worked out the box bar a minor tweak, see the comments. But it's worth pointing out that @RavinderSingh13 awk method worked as well with the same result. I will pursue @acushner panda method when / if I hear back and update this bit.
In case anyone cares, which I think is the best approach? Tomayto, tomarto, not sure that works in print. Personally, python is the easier to read. Awk is nice and short, I think it's a regex based language which I find harder to understand and was steeper learning curve.
Thanks to everyone who commented. I was wrestling with this for a long time and got it sorted on here pronto.
Upvotes: 1
Views: 1537
Reputation: 9946
using pandas
is pretty straightforward. you would just do something like this:
import pandas as pd
df_sku = pd.read_csv('file1.csv', index_col=False)
df_bc = pd.read_csv('file2.csv', index_col=False)
res = df_sku.merge(df_bc, left_on=['product_code', 'colour', 'size'], right_on=['ITEM_CODE', 'COLOUR', 'SIZE'])
give this a shot and let me know how you make out. additionally, if you're going to be doing a bunch of csv stuff, pandas
is 100% worth learning.
Upvotes: 0
Reputation: 133670
I believe there is some inconsistency in your output, ideally last line of your shown output shouldn't be there if we are considering $1,$2 and $3 as index of file2.csv file.
If yes that is a typo then following may help you in same.
awk -F", " 'FNR==NR{;a[$1,tolower($2),$3]=$NF;next} (($1,$2,$3) in a){$NF=a[$1,$2,$3]} 1' file2.csv file1.csv
EDIT: As OP has control M characters in Input_file(s) so adding following to it too now.
awk -F", " '{gsub('/\r/',"")} FNR==NR{;a[$1,tolower($2),$3]=$NF;next} (($1,$2,$3) in a){$NF=a[$1,$2,$3]} 1' file2.csv file1.csv
Upvotes: 1
Reputation: 25809
A simple and versatile Python approach:
import csv
# load the second file in memory as we need it for a barcode lookup
barcodes = {} # will hold a map for our barcode values
with open("file2.csv", "rU") as f: # open file2.csv for reading
reader = csv.reader(f) # create a CSV reader, you may need to tweak its options
next(reader) # skip the header
for row in reader: # loop it row by row
# store code:color:size map as a tuple in our barcodes map
barcodes[(row[0], row[1].lower(), row[2])] = row[3]
# now you can get a barcode from it like: barcodes[("item98", "red", "XL")]
# open file1.csv for reading and matched-result.csv for writing
with open("file1.csv", "rU") as f, open("matched-result.csv", "w") as out:
reader = csv.reader(f) # create a CSV reader, you may need to tweak its options
header = next(reader) # get the header
indexes = {column: index for index, column in enumerate(header)} # map column indexes
writer = csv.writer(out, lineterminator="\n") # create a CSV writer for our output
writer.writerow(header) # write the same header to the output file
for row in reader: # loop through file1.csv rows
# create a search map entry
search = (row[indexes["product_code"]], row[indexes["colour"]], row[indexes["size"]])
if search in barcodes: # we have a barcode entry
row[indexes["barcode"]] = barcodes[search] # update it in the current row
writer.writerow(row) # write the potentially modified row to our output file
If there is a possibility of the file2.csv
having columns in a different order you'll have to make sure that the mapping part ( barcodes[(row[0], row[1].lower(), row[2])] = row[3]
) matches the column order, tho.
EDIT - Since I didn't test it before, I wrote a simple test to see if it's working so:
file1.csv
:
product_code,colour,size,sku,added_col,barcode item98,red,XL,adfd56384678,1,null item99,black,L,adfgk9087467,2,null item98,red,S,adfgad240568,3,null
file2.csv
:
ITEM_CODE,COLOUR,SIZE,BAR_CODE item98,RED,XL,090900887 item98,RED,S,43581034 item97,BLUE,M,519685371
And it results in matched-result.csv
containing:
product_code,colour,size,sku,added_col,barcode item98,red,XL,adfd56384678,1,090900887 item99,black,L,adfgk9087467,2,null item98,red,S,adfgad240568,3,43581034
If your CSV files follow a structure as presented in the question, like these do, it should work just fine. The only problem I could potentially see in your data is that there is a space after a comma that separates the values. If that's indeed the case with your data add skipinitialspace=True
to the arguments of each csv.reader()
(e.g. reader = csv.reader(f, skipinitialspace=True)
) to account for the extra space. As I wrote in the code, you might need to tweak the csv.reader()/csv.writer()
options to match your CSV dialect.
Upvotes: 0