Reputation: 1373
I have two csv files like below.
CSV1
data13 data23 d main_data1;main_data2 data13 data23
data12 data22 d main_data1;main_data2 data12 data22
data11 data21 d main_data1;main_data2 data11 data21
data3 data4 d main_data2;main_data4 data3 data4
data52 data62 d main_data3 data51 data62
data51 data61 d main_data3 main_data3 data61
data7 data8 d main_data4 data7 data8
CSV2
id1 main_data1 a1 a2 a3
id2 main_data2 b1 b2 b3
id3 main_data3 c1 c2 c3
id4 main_data4 d1 d2 d3
id5 main_data5 e1 e2 e3
Now my question is, I know how to merge two CSV files when one of the columns is exactly the same in both the files. But my question is a little different. column 4 from CSV1 could contain column 2 from CSV2. I'd like to get a CSV file as below
FINAL_CSV
id1 main_data1 a1 a2 a3 data13
id2 main_data2 b1 b2 b3 data3
id3 main_data3 c1 c2 c3 main_data3
id4 main_data4 d1 d2 d3 data7
id5 main_data5 e1 e2 e3
where:
1. it matches the data from both the columns and gets corresponding rows from the first occurrence and write to the csv file.
2. When there's no match, it can leave the last column in FINAL_CSV blank or write 'NA' or anything of that sort.
3. When data in columns 4 and 5 of CSV1 match exactly, it returns that row instead of the first occurrence.
I'm totally lost on how to do this. Helping with a part of it is fine too. Any suggestions are highly appreciated.
PS- I know data from csv file should be separated by a comma, but for the sake of clarity, I preferred tabs, though the actual data is separated by commas.
EDIT: Actually, the 'main_data' can be in any column in CSV2, not in just column2. The same 'main_data' could also repeat in multiple rows, then I'd like to get all the corresponding rows.
Upvotes: 7
Views: 10395
Reputation: 9176
Have you considered using pandas? If you are familiar with R, then data-frames should be pretty straightforward. The following gives you what you want:
from pandas import merge, read_table
csv1 = read_table('CSV1.csv', sep=r"[;,]", header=None)
csv2 = read_table('CSV2.csv', sep=r"[,]", header=None)
print csv1
print csv2
Note that I replaced the tabs with commas and separated on semi-colons as well. The output so far should be:
0 1 2 3 4 5 6
0 data13 data23 d main_data1 main_data2 data13 data23
1 data12 data22 d main_data1 main_data2 data12 data22
2 data11 data21 d main_data1 main_data2 data11 data21
3 data3 data4 d main_data2 main_data4 data3 data4
4 data52 data62 d main_data3 NaN data51 data62
5 data51 data61 d main_data3 NaN main_data3 data61
6 data7 data8 d main_data4 NaN data7 data8
[7 rows x 7 columns]
0 1 2 3 4
0 id1 main_data1 a1 a2 a3
1 id2 main_data2 b1 b2 b3
2 id3 main_data3 c1 c2 c3
3 id4 main_data4 d1 d2 d3
4 id5 main_data5 e1 e2 e3
[5 rows x 5 columns]
Using a left-join:
kw1 = dict(how='left', \
left_on=[3,4], \
right_on=[1,1], \
suffixes=('l', 'r'))
df1 = merge(csv1, csv2, **kw1)
df1.drop_duplicates(cols=[3], inplace=True)
print df1[[0,7]]
Gives the zeroth and seventh column of the merge:
3 5
0 main_data1 data13
3 main_data2 data3
4 main_data3 data51
6 main_data4 data7
[4 rows x 2 columns]
And to give the output as you want it, do another merge (this time an outer-join) with CSV2
:
kw2 = dict(how='outer', \
left_on=[3], \
right_on=[1], \
suffixes=('l', 'r'))
df2 = merge(df1, csv2, **kw2)
print df2[[15,16,17,18,19,8]]
Output:
0 1 2 3r 4r 5
0 id1 main_data1 a1 a2 a3 data13
1 id2 main_data2 b1 b2 b3 data3
2 id3 main_data3 c1 c2 c3 data51
3 id4 main_data4 d1 d2 d3 data7
4 id5 main_data5 e1 e2 e3 NaN
You don't have to use **kw
for keyword arguments. I simply used it to make everything fit horizontally.
I let read_table
and merge
decide column names. If you assign column names yourself, you will get better looking output.
Upvotes: 3
Reputation: 15545
Since you originally asked for a Python solution to this I thought I would provide one. The simplest solution that occurred was to first load CSV1
and use it generate a mapping dictionary to use when generating the output from CSV2.
If I understand the input file correctly, only the values to the left of the ;
(if there is one) are to be considered. This can be achieved by using split(';')
and taking element zero. If there is no ;
then element zero will be the entire string. Assignment to the mapper
then just needs to follow the rules you've defined (only add if not already there, except when columns 4 & 5 match).
The code below produces your requested output:
import csv
mapper = dict()
with open('CSV1', 'r') as f1:
reader = csv.reader(f1)
for row in reader:
# Column 3 contains the match; but we only want the left-most (before semi-colon)
i = row[3].split(';')[0]
# Column 4 contains the target value for output
t = row[4]
if i not in mapper:
mapper[i] = t
elif row[3] == row[4]:
mapper[i] = t
with open('CSV2', 'r') as f2:
with open('FINAL_CSV', 'wb') as fo:
reader = csv.reader(f2)
writer = csv.writer(fo)
for row in reader:
if row[1] in mapper:
row.append( mapper[ row[1] ] )
writer.writerow(row)
The output file:
id1,main_data1,a1,a2,a3,data13
id2,main_data2,b1,b2,b3,data3
id3,main_data3,c1,c2,c3,main_data3
id4,main_data4,d1,d2,d3,data7
id5,main_data5,e1,e2,e3
To address the 'main_data can be in any column of CSV' modification use the following code:
for row in reader:
for r in row:
if r in mapper:
row.append( mapper[ r ] )
break
writer.writerow(row)
This will search each entry in the current row of CSV2 and if there is a match (to the original mapper data) append that mapped data to the row. The row will then be written as before.
Upvotes: 2
Reputation:
A way with (g)awk .
awk -F, 'NR==FNR{a[$2]=$0;next}
{split($4,b,";");x=b[1]}
(x in a)&&!c[x]++{d[x]=$5}
($5 in a){d[$5]=$5}
END{n=asorti(a,e);for(i=1;i<=n;i++)print a[e[i]]","d[e[i]]}' CSV1 CSV2
Output
id1,main_data1,a1,a2,a3,data13
id2,main_data2,b1,b2,b3,data3
id3,main_data3,c1,c2,c3,main_data3
id4,main_data4,d1,d2,d3,data7
id5,main_data5,e1,e2,e3,
Upvotes: 3
Reputation: 911
Since the condition for merging seems to be complicated it might be worthwhile to load the data into a database and use SQL. Using SQLite in-memory you can do this like this (assuming comma separated data)
import csv
import sqlite3
def createTable(cursor, rows, tablename):
tableCreated = False
for row in rows:
if not tableCreated:
sql = "CREATE TABLE %s(ROW INTEGER PRIMARY KEY, " + ", ".join(["c%d" % (i+1) for i in range(len(row))]) + ")"
cur.execute(sql % tablename)
tableCreated = True
sql = "INSERT INTO %s VALUES(NULL, " + ", ".join(["'" + c + "'" for c in row]) + ")"
cur.execute(sql % tablename)
conn.commit()
conn = sqlite3.connect(":memory:")
cur = conn.cursor()
for filename, tablename in [(path_to_csv1, "CSV1"), (path_to_csv2, "CSV2")]:
with open(filename, "r") as f:
reader = csv.reader(f, delimiter=',')
rows = [row for row in reader]
createTable(cur, rows, tablename)
You can then formulate your join logic in SQL. You can run queries like this:
for row in cur.execute(your_sql_statement):
print row
The following query gives the desired output:
WITH
MATCHES AS( -- get all matches
SELECT CSV2.*
, CSV1.ROW as ROW_1
, CSV1.C4 as C4_1
, CSV1.C5 as C5_1
FROM CSV2
LEFT JOIN CSV1
ON CSV1.C4 LIKE '%' || CSV2.C2 || '%'
),
EXACT AS( -- matches where CSV1.C4 = CSV1.C5
SELECT *
FROM MATCHES
WHERE C4_1 = C5_1
),
MIN_ROW AS( -- CSV1.ROW of first occurence for each CSV2.C1
SELECT C1
, min(ROW_1) as ROW_1
FROM MATCHES
WHERE C1 NOT IN (SELECT C1 FROM EXACT)
GROUP BY C1, C2, C3, C4, C5
)
-- use C4=C5 first
SELECT *
FROM EXACT
UNION
-- if match not in exact, use first occurence
SELECT MATCHES.*
FROM MIN_ROW
INNER JOIN MATCHES
ON MIN_ROW.C1 = MATCHES.C1
AND (MIN_ROW.ROW_1 = MATCHES.ROW_1 OR MIN_ROW.ROW_1 IS NULL)
ORDER BY C1
Upvotes: 2