Reputation: 63

Find the similar columns in two files

I am using python and I want to write a code, which will compare only two first columns (0:1) and if the other file has the same values in the column 0 and 1, then the line should be merged and written into one new file.

for example:

Or example with persons:

the code which I have now is this, but it is not good

f1 = open('f1.txt','r')
f2 = open('f2.txt','r')
f3 = open('f12.txt','w')


f1_readlines = f1.readlines()
f1_linenum = len(f1_readlines)


f2_readlines = f2.readlines()
f2_linenum = len(f2_readlines)

i=0
while(i<f1_linenum):
    j=0
    while(j<f2_linenum):
        if(f1_readlines[0:1] == f2_readlines[0:1]):
            print(f1_readlines[i])
            f12.write(f1_readlines[i])
        j = j + 1
    i = i + 1

f1.close()
f2.close()
f12.close()

Upvotes: 1

Answers (3)

jklasdjkl

Reputation: 11

I have tried to solve it in a naive way

f1 = open('f1.txt','r')
f2 = open('f2.txt','r')
f3 = open('fusion.txt','w')

# read f1 lines from file
f1_readlines = f1.readlines()
# get f1 length
f1_linenum = len(f1_readlines)

# read f2 lines from file
f2_readlines = f2.readlines()
# get f2 length
f2_linenum = len(f2_readlines)

for f1_line in f1_readlines:
    arr1 = f1_line.split(" ")
    c11 = arr1[0]
    c12 = arr1[1]

    for f2_line in f2_readlines:
        arr2 = f2_line.split(" ")
        c21 = arr2[0]
        c22 = arr2[1]
        if((c11 == c21) & (c12 == c22)):
            output = [c11,c12]

            for c in arr1[2:]:
                s = c.strip("\n")
                output.append(s)

            for c in arr2[2:]:
                s = c.strip("\n")
                output.append(s)

            for num in output[0:len(output)-1]:
                f3.write(num)
                f3.write(" ")
            f3.write(output[len(output)-1])
            f3.write("\n")

f1.close()
f2.close()
f3.close()

10 10 55
77 77 77
20 22 20
11 13 11
2 23 23
56 66 55 99
78 56 56

fusion

10 10 25 55
56 66 55 55 99
78 56 56 7 56
77 77 77 77

Hope this can solve it :)

Upvotes: 1

Rory Daulton

Reputation: 22564

Here is an algorithm to do what you want with two unsorted files. This algorithm has the advantage of needing only one of the files in memory and has a time complexity that is linear in the sum of the lengths of the input files. In other words, this algorithm uses small amounts of memory and of time--many other algorithms will take more space or time. (The answer by @dede seems to work well and is simpler than mine but it uses both more memory and more time, which will be very noticeable for large input files.)

First, read through the first file, line by line, and build a dictionary from it. Each key is a tuple of the first two items in a line, and the corresponding value is the rest of the line. In your second example the resulting dictionary would be

{('Brad', 'Pitt'): 'cherries', ('Angelina', 'Jolie'): 'bred', ('Jack', 'Nicholson'): 'apples', ('Nicole', 'Kidman'): 'cucumber'}

You then create an empty output file and read through the second input file line by line. For each line, you see if the first two items are in the dictionary. If so, print your desired line to the output file. If not, do nothing with that line.

Your main use of memory is then the dictionary. You went through each input file only once and line by line, so it is fast. The main possible weakness of this approach is that the output file will be in the order that the items were in the second input file, which is the order in your second example. If you desire the order of the first input file instead, just swap the usage of the two input files.

Here is my code from that algorithm. This version assumes that each input line has exactly three items separated by spaces or tabs. If the "third item" in a line could include a space or a tab, the code would need to be complicated a little. Using your example input files, the results from this code are just what you wanted.

def similar_columns(filenameinput1, filenameinput2, filename_output):
    """Find the similar columns in two files.
    This assumes each line has exactly three items.
    """
    # Build a dictionary of the items in the first input file
    items_dict = {}
    with open(filenameinput1, 'r') as inputfile1:
        for line in inputfile1:
            col0, col1, oldcolrest = line.split()
            items_dict[(col0, col1)] = oldcolrest

    # Compare the items in the second input file, saving matches
    with open(filenameinput2, 'r') as inputfile2,  \
         open(filename_output, 'w') as outputfile:
        for line in inputfile2:
            col0, col1, newcolrest = line.split()
            oldcolrest = items_dict.get((col0, col1), None)
            if oldcolrest is not None:
                outputfile.write('{} {} {} {}\n'.format(
                        col0, col1, oldcolrest, newcolrest))

similar_columns('f1.txt', 'f2.txt', 'f12.txt')
similar_columns('shop1.txt', 'shop2.txt', 'total.txt')

Upvotes: 1

dede

Reputation: 726

What is your problem with the code?

Okay...using a while-loop is not the best. You can do it like this:

f1 = open('f1.txt','r')
f1_readlines = f1.readlines()
for ln in f1_readlines:
    print ln,

(EDIT) This code:

f1_readlines=[[10, 10, 10], [20, 20, 20], [30, 30, 30]]
f2_readlines=[[20, 20, 25], [30, 30, 11], [25, 25, 55], [44, 44, 58]]
f12=[]

for f1e in f1_readlines:
  for f2e in f2_readlines:
    if f1e[0]==f2e[0] and f1e[1]==f2e[1]:
      f1e.append(f2e[2])
      f12.append(f1e)
      break

for e in f12:
  print e

gives:

[20, 20, 20, 25]
[30, 30, 30, 11]

Upvotes: 1

Find the similar columns in two files

Answers (3)

Related Questions