Kay
Kay

Reputation: 11

How to compare and merge text files with differing column numbers?

Can anyone please direct me on how to perform the following merge in Python...

Text File #1:

5     apple
1     banana
3     pear
4     kiwi

(plus a few thousand more entries)

My text file #2 looks like

apple
orange    
strawberry
banana

I want to combine the two text files such that I only add those which are in both but also keep the original numbers from the text file #1 link to the corresponding identifier. So in this example, my merge would look like this:

5     apple
1     banana

Upvotes: 0

Views: 215

Answers (2)

Kay
Kay

Reputation: 11

I apologize for not providing information on my attempts earlier (I wasn't trying to ask for codes for free, just got stuck and needed some guidance).

Essentially I had a txt document with 700,000 words in paragraph form and I wanted to count the words and cross-reference it to another document which was in list form. I got this far

fname = raw_input("Enter file name: ")
fh = open(fname)
inp = fh.read().upper()
new_fh2 = inp.replace('.','').replace(',','').replace('?','')
new_fh3 = new_fh2.replace('-','').replace('_','').replace(';','')
new_fh4 = new_fh3.replace(':','').replace('!','').replace('(','')
new_fh5 = new_fh4.replace(')','').replace('/','')
new_fh6 = new_fh5.replace('|','').replace('&','').replace('[','')
new_fh7 = new_fh6.replace(']','').replace('%','').replace('+','')
new_fh8 = new_fh7.replace('*','').replace('@','').replace('=','')
new_fh9 = new_fh8.replace('>','').replace('<','')
new_fh10 = new_fh9.replace('{','').replace('}','').replace('~','')
new_fh11 = new_fh10.replace('"','').split()
new_fh12 = sorted(set(new_fh11)) 
for word in new_fh12:
    print new_fh11.count(word), word`

At this point I was prepared to use LibreOffice Base to do my comparison using 2 tables, but even with the count function that reduced by word count from 700k to 34k, entering in data crashed the program whenever I tried to upload. So I had to try and think of a code which would allow me to compare the two txt files in python, which handles this volume of data nicely. And I really had NO idea where to even begin, although I did know of a few merge functions I just didn't know how to define the merge. I ended up doing this instead

new_fh12 = new_fh11.split()
new_fh12.sort()
for x in sorted(new_fh12):
    print x

then I took this list and put it into excel in one column, added my second list to another column, then used the countif function to count and compare the two lists.

Upvotes: 0

Rakesh Adhikesavan
Rakesh Adhikesavan

Reputation: 12826

Here is one possible approach:

Edit: taking the comment into consideration

I would first read your text file #1 into a Python Dictionary

d = dict()
with open("file1.txt") as f:
    for line in f:
       (val, key) = line.split()
       d[key] = int(val)

print d

Out: {'kiwi': 4, 1: 'banana', 3: 'pear', 4: 'kiwi', 5: 'apple', 'pear': 3, 'banana': 1, 'apple': 5}

Now, we read file 2 as a Python list

with open("file2.txt") as f:
    l = f.read().splitlines()

print l
Out: ['apple', 'orange', 'strawberry', 'banana']

Now, create another dictionary with the desired output:

d2 = {key:val for key,val in d.iteritems() if key in l}
print d2
Out: {'apple': 5, 'banana': 1}

I will leave it up to you to figure out how to write a dictionary into a text file. I would use pandas to convert it into a dataFrame and write the dataFrame as csv or a tsv. This is a workaround, there has to be a more direct way to do it.

Upvotes: 1

Related Questions