kromatix
kromatix

Reputation: 17

Comparing two files for similarities, not the commonly asked question

My problem: I have two .txt files, I would like to use one of the files as a guide for the filtering the second file. Appending the similarities to a new .txt file 3.

For example: File 1: A list of names File 2: A list of names and email addressess.

If any name from file 1 is not found in any line(s) in file 2, delete that line and append the matching line to a new .txt file.

Obviously I have googled this question every which way I could word it, and have even found a web application that does exactly this, however it is not capable of handling the size files I need. I have attempted to write a python script for doing this (I am fairly new to programming), from what i have read im sure it would be easier using something like NumPy which I do not know. I just need a nudge in the right direction, this is just slightly outside of my skill set. I am capable of writing a script for web scraping using regex and other basic beginning stuff like that, but this is something I really need to solve quickly and cannot seem to find a solution that truly fits the problem elsewhere. Every other solution to similarly asked questions is referring to a single string, or showing differences not similarities.

This is my attempt, which is obviously incorrect:

    file1 = input("Input file 1: ")
    file2 = input("Input file 2: ")
            
    with open("file1.txt", r) as f1:
        lines1 = f1.read.splitlines()
        names = file1.split(";")[0]
        emails = file1.split(";")[1]
    with open("file2.txt", r) as f2:
        lines2 = f2.read.splitlines()
            
        newfile = open("newfile", w)
            
    for names in lines2:
        strip(line)
        newfile.write(line)

I would really appreciate some advice or a nudge in the correct direction. Thank you !

File sample:

file 1:
[email protected]  
[email protected]  
[email protected]  
[email protected]  

File 2:  
1.Jack Young;[email protected]  
2.George Russel;[email protected]  
3.Susan Shields;[email protected]  
4.Mary Cartwright;[email protected]  
5.Heather Carter;[email protected]  
6.Denise Black;[email protected]  
7.Tanner Tennebaum;[email protected]  
8.John Grable;[email protected]  
9.Connor Hawk;[email protected]  

So I am looking to parse the first 4 Name;Email lines in file 2 using file 1 as the source of interesting data.

Upvotes: 2

Views: 119

Answers (1)

Hugo G
Hugo G

Reputation: 16496

My assumptions:

  • file1.txt contains one email address per line and nothing but that.
  • file2.txt contains one name and one email per line, separated by a semicolon
  • you are looking for matches between the two datasets
# first, read in the smaller set of the entries we are interested in finding
with open("file1.txt") as file1:
  # strip them of any leading or trailing whitespace (e.g. newlines) and empty lines
  needles = [line.strip() for line in file1.readlines()]
  needles = set(filter(bool, needles))

# now open the haystack file2 and the output file3
with open("file2.txt") as file2, open("file3.txt", "w") as file3:
  # we iterate line by line to not fill up the memory too much
  for line in file2:
    # strip any whitespace from the line
    line = line.strip()
    # ...and skip over empty lines
    if not line: continue
    # we assume each line contains exactly one semicolon that separates name from email
    name, email = line.split(";")
    # does the email match any of the ones we are looking for?
    if email in needles:
      # then format the line correctly and write it to the output file
      file3.write("{};{}\n".format(name, email))

Contents of the input files:

file1.txt

[email protected]  
[email protected]  
[email protected]  
[email protected] 

file2.txt

Jack Young;[email protected]  
George Russel;[email protected]  
Susan Shields;[email protected]  
Mary Cartwright;[email protected]  
Heather Carter;[email protected]  
Denise Black;[email protected]  
Tanner Tennebaum;[email protected]  
John Grable;[email protected]  
Connor Hawk;[email protected]  

contents of file3.txt after running the script:

Jack Young;[email protected]
George Russel;[email protected]
Susan Shields;[email protected]
Mary Cartwright;[email protected]

Upvotes: 2

Related Questions