Reputation: 129
Hi there I have 2 excel files, where on one is a table with some names (col A, excel 1) and values and the other one contains a phylogenetic tree with all the data stored in 6 columns tab separated (excel 2).
Xlsx 1:
g__Mannheimia
g__Peptococcus
g__unspecific_Proteobacteria
g__Acinetobacter
Xlsx 2 (6 columns):
Row1:
d__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Salmonella s__
Row2:
d__Bacteria p__Firmicutes c__Bacilli o__Lactobacillales f__Streptococcaceae g__Streptococcus s__
Row3:
d__Bacteria p__Actinobacteriota c__Actinobacteria o__Mycobacteriales f__Mycobacteriaceae g__Mycobacterium s__
Row4:
d__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Klebsiella s__
Row5:
d__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Pseudomonadales f__Moraxellaceae g__Acinetobacter s__
What I'm trying to do is, search each row of xlsx 2 for the names in xlsx 1, then if an exact match is found on any of the 6 columns, copy the whole row (6 columns in total) and replace the name (1 column, should also stay like that) in xlsx 1 with this data. This can be done by hand with the search and replace function, but because it's a huge amount of data I'm trying to find another way to do it.
I'm trying to solve this with the help of pandas and already tried some stuff. At the moment I'm stuck here, because I can't even find my matching data on the second xlsx.
import sys
import pandas as pd
analysis_data_df = pd.read_excel('Mund_Biome_Raw_Data.xlsx', sheet_name='Genus', header=None, usecols='A', skiprows=3)
analysis_data = analysis_data_df.values.tolist()
# print(List1)
source_data_df = pd.read_excel('170519_taxonomy_in_qiime.xlsx', sheet_name='97_otu_taxonomy', header=None)
source_data = source_data_df.values.tolist()
# print(list_2)
matching = [s for s in source_data if any(xs in s for xs in analysis_data)]
print(matching)
Any kind of help is much appreciated!
Upvotes: 0
Views: 84
Reputation: 33
You could use an algorithm called suffix automaton and do the search efficiently (O(n) taking k on O(nlogk) as a constant being its the alphabets size), you can find more about this alg and its implementation here: https://cp-algorithms.com/string/suffix-automaton.html
Because you are not making too complicated comparison, you could use a simpler and shorter algorithm like kmp which is O(T+P) being T the length of the larger string and P the pattern length to look for: https://www.google.com.co/amp/s/www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/amp/
Or use regex, which is in python standard library: https://docs.python.org/3/library/re.html This last approach can take up more time to compute, but its a simpler implementation
Cheers :)
Upvotes: 1