Reputation: 1373

Merge two CSV files based on a data from a column

I have two csv files like below.

CSV1

data13      data23      d      main_data1;main_data2      data13         data23
data12      data22      d      main_data1;main_data2      data12         data22
data11      data21      d      main_data1;main_data2      data11         data21
data3       data4       d      main_data2;main_data4      data3          data4
data52      data62      d      main_data3                 data51         data62
data51      data61      d      main_data3                 main_data3     data61
data7       data8       d      main_data4                 data7          data8

CSV2

id1      main_data1      a1      a2      a3
id2      main_data2      b1      b2      b3
id3      main_data3      c1      c2      c3
id4      main_data4      d1      d2      d3
id5      main_data5      e1      e2      e3

Now my question is, I know how to merge two CSV files when one of the columns is exactly the same in both the files. But my question is a little different. column 4 from CSV1 could contain column 2 from CSV2. I'd like to get a CSV file as below

FINAL_CSV

id1      main_data1      a1      a2      a3      data13
id2      main_data2      b1      b2      b3      data3
id3      main_data3      c1      c2      c3      main_data3
id4      main_data4      d1      d2      d3      data7
id5      main_data5      e1      e2      e3

where:
1. it matches the data from both the columns and gets corresponding rows from the first occurrence and write to the csv file.
2. When there's no match, it can leave the last column in FINAL_CSV blank or write 'NA' or anything of that sort.
3. When data in columns 4 and 5 of CSV1 match exactly, it returns that row instead of the first occurrence.

I'm totally lost on how to do this. Helping with a part of it is fine too. Any suggestions are highly appreciated.
PS- I know data from csv file should be separated by a comma, but for the sake of clarity, I preferred tabs, though the actual data is separated by commas.

EDIT: Actually, the 'main_data' can be in any column in CSV2, not in just column2. The same 'main_data' could also repeat in multiple rows, then I'd like to get all the corresponding rows.

Upvotes: 7

Answers (4)

lafras

Reputation: 9176

Have you considered using pandas? If you are familiar with R, then data-frames should be pretty straightforward. The following gives you what you want:

from pandas import merge, read_table

csv1 = read_table('CSV1.csv', sep=r"[;,]", header=None)
csv2 = read_table('CSV2.csv', sep=r"[,]",  header=None)

print csv1
print csv2

Note that I replaced the tabs with commas and separated on semi-colons as well. The output so far should be:

        0       1   2           3           4           5       6
0  data13  data23   d  main_data1  main_data2      data13  data23
1  data12  data22   d  main_data1  main_data2      data12  data22
2  data11  data21   d  main_data1  main_data2      data11  data21
3   data3   data4   d  main_data2  main_data4       data3   data4
4  data52  data62   d  main_data3         NaN      data51  data62
5  data51  data61   d  main_data3         NaN  main_data3  data61
6   data7   data8   d  main_data4         NaN       data7   data8

[7 rows x 7 columns]
     0           1   2   3   4
0  id1  main_data1  a1  a2  a3
1  id2  main_data2  b1  b2  b3
2  id3  main_data3  c1  c2  c3
3  id4  main_data4  d1  d2  d3
4  id5  main_data5  e1  e2  e3

[5 rows x 5 columns]

Using a left-join:

kw1 = dict(how='left', \
          left_on=[3,4], \
          right_on=[1,1], \
          suffixes=('l', 'r'))

df1 = merge(csv1, csv2, **kw1)
df1.drop_duplicates(cols=[3], inplace=True)

print df1[[0,7]]

Gives the zeroth and seventh column of the merge:

            3       5
0  main_data1  data13
3  main_data2   data3
4  main_data3  data51
6  main_data4   data7

[4 rows x 2 columns]

And to give the output as you want it, do another merge (this time an outer-join) with CSV2:

kw2 = dict(how='outer', \
           left_on=[3], \
           right_on=[1], \
           suffixes=('l', 'r'))

df2 = merge(df1, csv2, **kw2)

print df2[[15,16,17,18,19,8]]

Output:

     0           1   2  3r  4r       5
0  id1  main_data1  a1  a2  a3  data13
1  id2  main_data2  b1  b2  b3   data3
2  id3  main_data3  c1  c2  c3  data51
3  id4  main_data4  d1  d2  d3   data7
4  id5  main_data5  e1  e2  e3     NaN

You don't have to use **kw for keyword arguments. I simply used it to make everything fit horizontally.

I let read_table and merge decide column names. If you assign column names yourself, you will get better looking output.

Upvotes: 3

mfitzp

Reputation: 15545

Since you originally asked for a Python solution to this I thought I would provide one. The simplest solution that occurred was to first load CSV1 and use it generate a mapping dictionary to use when generating the output from CSV2.

If I understand the input file correctly, only the values to the left of the ; (if there is one) are to be considered. This can be achieved by using split(';') and taking element zero. If there is no ; then element zero will be the entire string. Assignment to the mapper then just needs to follow the rules you've defined (only add if not already there, except when columns 4 & 5 match).

The code below produces your requested output:

import csv

mapper = dict()
with open('CSV1', 'r') as f1:
    reader = csv.reader(f1)
    for row in reader:
        # Column 3 contains the match; but we only want the left-most (before semi-colon)
        i = row[3].split(';')[0]
        # Column 4 contains the target value for output
        t = row[4]
        if i not in mapper:
            mapper[i] = t
        elif row[3] == row[4]:
            mapper[i] = t        

with open('CSV2', 'r') as f2:
    with open('FINAL_CSV', 'wb') as fo:
        reader = csv.reader(f2)
        writer = csv.writer(fo)
        for row in reader:
            if row[1] in mapper:
                row.append( mapper[ row[1] ] )
            writer.writerow(row)

The output file:

id1,main_data1,a1,a2,a3,data13
id2,main_data2,b1,b2,b3,data3
id3,main_data3,c1,c2,c3,main_data3
id4,main_data4,d1,d2,d3,data7
id5,main_data5,e1,e2,e3

To address the 'main_data can be in any column of CSV' modification use the following code:

for row in reader:
    for r in row:
        if r in mapper:
            row.append( mapper[ r ] )
            break

    writer.writerow(row)

This will search each entry in the current row of CSV2 and if there is a match (to the original mapper data) append that mapped data to the row. The row will then be written as before.

Upvotes: 2

user3442743

Reputation:

A way with (g)awk .

 awk -F, 'NR==FNR{a[$2]=$0;next}
         {split($4,b,";");x=b[1]}
         (x in a)&&!c[x]++{d[x]=$5}
         ($5 in a){d[$5]=$5}
         END{n=asorti(a,e);for(i=1;i<=n;i++)print a[e[i]]","d[e[i]]}'  CSV1 CSV2

Output

id1,main_data1,a1,a2,a3,data13
id2,main_data2,b1,b2,b3,data3
id3,main_data3,c1,c2,c3,main_data3
id4,main_data4,d1,d2,d3,data7
id5,main_data5,e1,e2,e3,

Upvotes: 3

Matthias Ossadnik

Reputation: 911

Since the condition for merging seems to be complicated it might be worthwhile to load the data into a database and use SQL. Using SQLite in-memory you can do this like this (assuming comma separated data)

import csv
import sqlite3

def createTable(cursor, rows, tablename):
    tableCreated = False
    for row in rows:
        if not tableCreated:
            sql = "CREATE TABLE %s(ROW INTEGER PRIMARY KEY, " + ", ".join(["c%d" % (i+1) for i in range(len(row))]) + ")"
            cur.execute(sql % tablename)
            tableCreated = True
        sql = "INSERT INTO %s VALUES(NULL, " + ", ".join(["'" + c + "'" for c in row]) + ")"
        cur.execute(sql % tablename)
    conn.commit()


conn = sqlite3.connect(":memory:")
cur = conn.cursor()

for filename, tablename in [(path_to_csv1, "CSV1"), (path_to_csv2, "CSV2")]:
    with open(filename, "r") as f:
        reader = csv.reader(f, delimiter=',')        
        rows = [row for row in reader]
    createTable(cur, rows, tablename)

You can then formulate your join logic in SQL. You can run queries like this:

for row in cur.execute(your_sql_statement):
    print row

The following query gives the desired output:

WITH
MATCHES AS( -- get all matches
    SELECT      CSV2.*
                , CSV1.ROW as ROW_1                 
                , CSV1.C4 as C4_1
                , CSV1.C5 as C5_1
    FROM        CSV2 
    LEFT JOIN   CSV1 
    ON          CSV1.C4 LIKE '%' || CSV2.C2 || '%'    
),
EXACT AS( -- matches where CSV1.C4 = CSV1.C5
    SELECT      *
    FROM        MATCHES
    WHERE       C4_1 = C5_1
),
MIN_ROW AS( -- CSV1.ROW of first occurence for each CSV2.C1
    SELECT      C1
                , min(ROW_1) as ROW_1
    FROM        MATCHES
    WHERE       C1 NOT IN (SELECT C1 FROM EXACT)
    GROUP BY    C1, C2, C3, C4, C5                  
)
-- use C4=C5 first
SELECT      *
FROM        EXACT
UNION
-- if match not in exact, use first occurence
SELECT      MATCHES.*
FROM        MIN_ROW
INNER JOIN  MATCHES
ON          MIN_ROW.C1 = MATCHES.C1
AND         (MIN_ROW.ROW_1 = MATCHES.ROW_1 OR MIN_ROW.ROW_1 IS NULL)
ORDER BY    C1

Upvotes: 2

Merge two CSV files based on a data from a column

Answers (4)

Related Questions