Unexpected result when I comparing two file to report the difference between them

Question

I´m working with two text files that are similar but not the same.

File 1:

GCF_000739415.1
GCF_001263815.1
GCF_001297745.1
...

File 2:

GCA_000739415.1
GCA_001263815.1
...

Here, I´m looking for a specific pattern to differentiate them, his name is ID. For example, ID´ file 1: GCF_000739415.1, GCF_001263815.1, GCF_001297745.1 ID´file 2: GCA_000739415.1, GCA_001263815.1 The only difference between IDs is GCF versus GCA, this difference it´s only for the database where they come from, but the numbers are the same. However, file 2 has not GCF_001297745.1 version (GCA_001297745.1), so my goal is to report what IDs are not in both files. For example, "GC*_001297745.1 is not in the file 2"

With these in mind, I´m using this code:

with open("assembly_summary_genbank.txt", 'r') as f_1:
    contents_1 = f_1.readlines()
with open("assembly_summary_refseq.txt", 'r') as f_2:
    contents_2 = f_2.readlines()

# PART 2: Search for IDs
matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
#print(matches_1)
for match in matches_1:
    if match not in matches_2:
        print(f"{match} is not in both files")

My unexpected result is this:

GCF_000739415.1 is not in both files
GCF_001263815.1 is not in both files
GCF_001297745.1 is not in both files

When I need something like this:

GC*_001297745.1 is not in both files

I put * in the third character (F or A) because this is a difference that doesn't matter. I´m looking for IDs that are not in both files, any comment to fix this unexpected result is welcome.

001 · Accepted Answer

You could just add capture groups to the regexes so that the GCF_ and GCA_ are not part of the results but do help in the search.

matches_1 = set(re.findall("GCF_([0-9]*\.[0-9])", str(contents_1)))
matches_2 = set(re.findall("GCA_([0-9]*\.[0-9])", str(contents_2)))
for match in matches_1:
    if match not in matches_2:
        print(f"GC*_{match} is not in both files")

Output

GC*_001297745.1 is not in both files

I also made the results sets to avoid duplicates. With them being sets, you can:

for match in matches_1.symmetric_difference(matches_2):
    print(f"GC*_{match} is not in both files")

Which I think will produce a better result since your for loop only finds items from contents_1 that are not in contents_2 but not items that are in contents_2 but not in contents_1.

Unexpected result when I comparing two file to report the difference between them

Answers (2)

Related Questions