Reputation: 432
I´m working with two text files that look like this: File 1
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
...
File 2:
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
So, I want to search for a specific pattern using regex. For example, file 1 has this pattern:
GCF_000739415.1
and file 2 this one:
GCA_000739415.1
The difference is the third character: F versus A. However, sometimes numbers differ. Difference between files is the third row of data. These two files have a lot of patterns like the previous one, however, there are some differences. My goal is to search for the pattern that only exists in one file and not in the other file. For example, "GCF_001297745.1 in the third row in the file 1 but not in the file 2. This should be a GCA_001297745.1"
I´m working on a python code:
# PART 1: Open and read text file
with open("assembly_summary_genbank.txt", 'r') as f_1:
contents_1 = f_1.readlines()
with open("assembly_summary_refseq.txt", 'r') as f_2:
contents_2 = f_2.readlines()
# PART 2: Search for IDs
matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
# PART 3: Match between files
# Seudocode
for line in matches_1:
if matches_1 == matches_2:
print("PATTERN THAT ONLY EXIST IN ONE FILE")
Part 3 refers to doing a for loop that searches for each line in both files and prints the patterns that only exist in one file and not in the other one. Any idea for doing this for loop?
Upvotes: 0
Views: 104
Reputation: 4496
Perhaps you are after this?
import re
given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
altered_example = "GCA_000739415.1 GCTEST_000739415.1"
# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
regex = r"GC[AF]_\d+.\d+"
matches_1 = re.findall(regex, given_example)
matches_2 = re.findall(regex, altered_example)
# Iteration for intersection
for match in matches_1:
if match in matches_2:
print(f"{match} is in both files")
Prints
GCA_000739415.1 is in both files
GCA_000739415.1 is in both files
But I would recommend:
# The preferred method for intersection, where order is not important
matches = list(set(matches_1) & set(matches_2))
Which saves as:
['GCA_000739415.1']
Note the regex matches in a form of GX[A or F]_[number; digit >= 1].[number; digit >= 1]
. Let me know if this is not what you are after
I believe you are after the symmetric difference of sets for files 1 and 2. Which is a fancy way of saying "things in A & B, that are not in both"
Which can be done with literation:
# Iteration
# A set has no duplicates, and is unordered
sym_dif = set()
for match in matches_1:
if match not in matches_2:
sym_dif.add(match)
>>> list(sym_dif)
['GCF_001297745.1', 'GCA_001297745.1']
I think your mistake was not using a set, you should't have any duplicates, and using matches_1 == matches_2
. The lists won't be the same. You should check if it is not in
the other set.
Or using this set notation which is the preferred method:
>>> list(set(matches_1).symmetric_difference(set(matches_2)))
['GCF_001297745.1', 'GCA_001297745.1']
Upvotes: 2
Reputation: 903
As I am looking at these files, it is a data frame from the txt file link you shared. ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt
The assembly_summary.txt files have 22 tab-delimited columns. Header rows begin with '#".
A better approach would start would be to open the files as tab-separated pandas data frames and apply a function to split and replace all the F with A, and then merge the two files or simply use the is in
command to get the indices of the common elements. Here is the code:
Notebook with rationale and algorithm, with comments -
https://colab.research.google.com/drive/1jJYnDpMVCt1spRsUek7RqlMnC_2O92sq?usp=sharing
import pandas as pd
url1="https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt"
url2 = "https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt"
df1=pd.read_csv(url1,sep='\t',low_memory=False)
df2=pd.read_csv(url2,sep='\t',low_memory=False)
def replaceFs(string_test):
list_words = list(string_test)
list_words[2]='A'
return_string = ''.join(list_words)
return return_string
def table_reform(unformed_df):
unformed_df = unformed_df.reset_index()
unformed_df = unformed_df.rename(columns=unformed_df.iloc[0])
reformed_df = unformed_df[1:]
return reformed_df
df1 = table_reform(df1)
df2 = table_reform(df2)
df2['# assembly_accession'] = df2['# assembly_accession'].apply(replaceFs)
df_combine = pd.merge(df1,df2,on=['# assembly_accession'],how='inner')
df_combine
Which shows a huge data frame 254000 rows × 45 columns
<script src="https://gist.github.com/paritoshk/7eed427943399237c14b911adfee4428.js"></script>
Hope this helps! My personal view from the CS angle is that double loops with Regex are much slower on such large datasets vs Pandas algorithms.
Upvotes: 1