RRg
RRg

Reputation: 123

Find position of a particular word in a string

I have a list of Genes and I need to identify if the Gene from the list is present in the 'Article Title', if present find the start and the end position of the gene in the sentence.

The code developed does identify the gene and detects the position of the gene in the sentence. However, I need help with finding the start position and end position of the gene

doc = tree.getroot()
 for ArticleTitle in doc.iter('ArticleTitle'):
    file1 = (ET.tostring(ArticleTitle, encoding='utf8').decode('utf8'))
    filename = file1[52:(len(file1))]
    Article= filename.split("<")[0]
    # print(Article)
    # print(type(Article))
    title= Article.split()
    gene_list = ["ABCD1","ADA","ALDOB","APC","ARSB","ATAD3B","AXIN2","BLM","BMPR1A","BRAF","BRCA1"] 
    for item in title:
        for item1 in gene_list:
            if item == item1:
                str_title= ' '.join(title)
                print(str_title)
                print("Gene Found: " + item)
                index= title.index(item)
                print("Index of the Gene :" +str(index))

                result = 0
                for char in str_title:
                    result +=1
                print(result)

Current output is:

Healthy people 2000: a call to action for ADA members.
Gene Found: ADA
Index of the Gene :8
54

Expected output is:

Healthy people 2000: a call to action for ADA members.
Gene Found: ADA
Index of the Gene :8
Gene start position: 42
Gene End postion:  45

The start and end position should count the spaces between the words too.

Upvotes: 0

Views: 182

Answers (2)

qaiser
qaiser

Reputation: 2868

We can use Flashtext as well

from flashtext import KeywordProcessor

kpo = KeywordProcessor(case_sensitive=True)

gene_list = ["ABCD1","ADA","ALDOB","APC","ARSB","ATAD3B","AXIN2","BLM","BMPR1A","BRAF","BRCA1"] 

for word in gene_list:
    kpo.add_keyword(word)

kpo.extract_keywords("Healthy people 2000: a call to action for ADA members.",span_info=True)
#o/p --> [('ADA', 42, 45)]

Upvotes: 1

mad_
mad_

Reputation: 8273

Could use regex

l=["ABCD1","ADA","ALDOB","APC","ARSB"]
l='|'.join(l)
test_string='Healthy people 2000: a call to action for ADA members.'
pos=0
for i in test_string.split():
    m=re.search(l,i)
    if m:
        gene=m.group(0)
        start=test_string.find(gene)
        end=start+len(gene)
        print(start,end,gene,pos)
    pos+=1

Output

(42, 45, 'ADA', 8)

The shorter solution without the actual position in the string could be

l=["ABCD1","ADA","ALDOB","APC","ARSB"]
l='|'.join(l)
test_string='Healthy people 2000: a call to action for ADA members.'

[(m.start(),m.group(0),m.end()) for m in re.finditer(l,test_string)]

Upvotes: 1

Related Questions