Martin
Martin

Reputation: 117

How to find a string inside a XML like tag from a file in Python?

I have a RDF document, which looks like as follows:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:cd="http:xyz.com#">

<rdf:Description rdf:about="http:xyz.com#">
    <cd:algorithmid>DPOT-5ab247867d368</cd:algorithmid>
    <cd:owner>arun</cd:owner>
    <cd:acesskey>ACCESS-5ab247867d370</cd:acesskey>
    <cd:purpose>Research</cd:purpose>
    <cd:metadata>10</cd:metadata>
    <cd:completeness>Partial</cd:completeness>
    <cd:completeness>Yes</cd:completeness>
    <cd:inclusion_1>age</cd:inclusion_1>
    <cd:feature_1>Sex</cd:feature_1>
    <cd:target>Diagnosis</cd:target>
</rdf:Description>

</rdf:RDF> 

From the above texts, I need to extract the target (i.e. only the value inside the opening and closing "cd:target" tag). The desired output should be 'Diagnosis'. I tried with XML parser but it does not work because of the tree contains ':'. Any better solution, please?

Update: This is the I tried, sorry for naive coding style.

import xml.etree.ElementTree as et

def metadataParser(metadataFile):
    with open(metadataFile, 'r') as m:
        data = m.read() 
        # Load the xml content from a string
        content = et.fromstring(data)       
        description = content.find('rdf:Description')
        target = description.find("cd:target")

    return target   


target = metadataParser('metadata.rdf')
print(target)

Upvotes: 2

Views: 1201

Answers (5)

Martin Evans
Martin Evans

Reputation: 46759

You can create a dictionary holding the namespace mappings seen at the top:

import xml.etree.ElementTree as ET
import csv


tree = ET.parse('input.xml')
ns = {'rdf' : 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'cd' : 'http:xyz.com#'}

description = tree.find('rdf:Description', ns)
target = description.find('cd:target', ns)

print(target.text)

This would display:

Diagnosis

This approach is described in the Python xml.etree.ElementTree documentation.

Upvotes: 0

Chris
Chris

Reputation: 361

You could use the following regex: this will get all the data from within all of the 'cd' tags in your file..

import re

with open("file.rdf", "r") as file:

    for lines in file:
        pattern = "<cd:.*>(.*)</cd:.*>"
        output = re.findall(pattern, lines)
        if len(output) != 0:
            print(output[0])

And this outputs:

DPOT-5ab247867d368
arun
ACCESS-5ab247867d370
Research
10
Partial
Yes
age
Sex
Diagnosis

Explaination of the pattern variable:

  • the first .* tells the script that we want ANY characters that are in this space
  • (.*) tells the script that this is the section we want to capture
  • And the last .* does pretty much the same as before, searches for ANY character.

Note: I have involved a if statement to check if the output (which is in list form) contains any elements, if not, it excludes it from the output. (for example your heading RDF elements will be excluded).

Upvotes: 1

Robᵩ
Robᵩ

Reputation: 168626

The rdf: and cd: are namespace tags. They need to be replaced in your search with the actual namespace identifiers, like so:

description = content.find('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description')
target = description.find("{http:xyz.com#}target")

Upvotes: 1

Keyur Potdar
Keyur Potdar

Reputation: 7238

You can use the BeautifulSoup module with its XML parser.

from bs4 import BeautifulSoup

XML = '''
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:cd="http:xyz.com#">

<rdf:Description rdf:about="http:xyz.com#">
    <cd:algorithmid>DPOT-5ab247867d368</cd:algorithmid>
    <cd:owner>arun</cd:owner>
    <cd:acesskey>ACCESS-5ab247867d370</cd:acesskey>
    <cd:purpose>Research</cd:purpose>
    <cd:metadata>10</cd:metadata>
    <cd:completeness>Partial</cd:completeness>
    <cd:completeness>Yes</cd:completeness>
    <cd:inclusion_1>age</cd:inclusion_1>
    <cd:feature_1>Sex</cd:feature_1>
    <cd:target>Diagnosis</cd:target>
</rdf:Description>

</rdf:RDF>'''

soup = BeautifulSoup(XML, 'xml')

target = soup.find('target').text
print(target)
# Diagnosis

As you can see, it's pretty easy to use.

Upvotes: 2

John Gordon
John Gordon

Reputation: 33335

The cd: part is a namespace. They're pretty common in XML, and just about any XML parser has a way to handle them.

Otherwise, if you are just looking for a single item and you don't care about structure, you could just do a simple string search and grab everything between <cd:target> and </cd:target>, like so:

rdf = '''rdf xml document'''
open_tag = '<cd:target>'
close_tag = '</cd:target>'
start = rdf.find(open_tag)
end = rdf.find(close_tag)
value = rdf[start + len(open_tag):end]

Upvotes: 0

Related Questions