Reputation: 432
I´m working with two text files that are similar but not the same.
File 1:
GCF_000739415.1
GCF_001263815.1
GCF_001297745.1
...
File 2:
GCA_000739415.1
GCA_001263815.1
...
Here, I´m looking for a specific pattern to differentiate them, his name is ID. For example, ID´ file 1: GCF_000739415.1, GCF_001263815.1, GCF_001297745.1 ID´file 2: GCA_000739415.1, GCA_001263815.1 The only difference between IDs is GCF versus GCA, this difference it´s only for the database where they come from, but the numbers are the same. However, file 2 has not GCF_001297745.1 version (GCA_001297745.1), so my goal is to report what IDs are not in both files. For example, "GC*_001297745.1 is not in the file 2"
With these in mind, I´m using this code:
with open("assembly_summary_genbank.txt", 'r') as f_1:
contents_1 = f_1.readlines()
with open("assembly_summary_refseq.txt", 'r') as f_2:
contents_2 = f_2.readlines()
# PART 2: Search for IDs
matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
#print(matches_1)
for match in matches_1:
if match not in matches_2:
print(f"{match} is not in both files")
My unexpected result is this:
GCF_000739415.1 is not in both files
GCF_001263815.1 is not in both files
GCF_001297745.1 is not in both files
When I need something like this:
GC*_001297745.1 is not in both files
I put * in the third character (F or A) because this is a difference that doesn't matter. I´m looking for IDs that are not in both files, any comment to fix this unexpected result is welcome.
Upvotes: 0
Views: 43
Reputation: 13533
You could just add capture groups to the regexes so that the GCF_
and GCA_
are not part of the results but do help in the search.
matches_1 = set(re.findall("GCF_([0-9]*\.[0-9])", str(contents_1)))
matches_2 = set(re.findall("GCA_([0-9]*\.[0-9])", str(contents_2)))
for match in matches_1:
if match not in matches_2:
print(f"GC*_{match} is not in both files")
Output
GC*_001297745.1 is not in both files
I also made the results sets to avoid duplicates. With them being sets, you can:
for match in matches_1.symmetric_difference(matches_2):
print(f"GC*_{match} is not in both files")
Which I think will produce a better result since your for
loop only finds items from contents_1
that are not in contents_2
but not items that are in contents_2
but not in contents_1
.
Upvotes: 1
Reputation: 3624
If you see here it actually includes the 'GCA' / 'GCF' at the start so they will never be the same.
match (starting GCF): GCF_000739415.1
matches_2 (all starting GCA): ['GCA_000739415.1', 'GCA_001263815.1']
You will need to take out the GCF/GCA and then compare
Upvotes: 1