Reputation: 117
I have a RDF document, which looks like as follows:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http:xyz.com#">
<rdf:Description rdf:about="http:xyz.com#">
<cd:algorithmid>DPOT-5ab247867d368</cd:algorithmid>
<cd:owner>arun</cd:owner>
<cd:acesskey>ACCESS-5ab247867d370</cd:acesskey>
<cd:purpose>Research</cd:purpose>
<cd:metadata>10</cd:metadata>
<cd:completeness>Partial</cd:completeness>
<cd:completeness>Yes</cd:completeness>
<cd:inclusion_1>age</cd:inclusion_1>
<cd:feature_1>Sex</cd:feature_1>
<cd:target>Diagnosis</cd:target>
</rdf:Description>
</rdf:RDF>
From the above texts, I need to extract the target (i.e. only the value inside the opening and closing "cd:target" tag). The desired output should be 'Diagnosis'. I tried with XML parser but it does not work because of the tree contains ':'. Any better solution, please?
Update: This is the I tried, sorry for naive coding style.
import xml.etree.ElementTree as et
def metadataParser(metadataFile):
with open(metadataFile, 'r') as m:
data = m.read()
# Load the xml content from a string
content = et.fromstring(data)
description = content.find('rdf:Description')
target = description.find("cd:target")
return target
target = metadataParser('metadata.rdf')
print(target)
Upvotes: 2
Views: 1201
Reputation: 46759
You can create a dictionary holding the namespace mappings seen at the top:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse('input.xml')
ns = {'rdf' : 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'cd' : 'http:xyz.com#'}
description = tree.find('rdf:Description', ns)
target = description.find('cd:target', ns)
print(target.text)
This would display:
Diagnosis
This approach is described in the Python xml.etree.ElementTree documentation.
Upvotes: 0
Reputation: 361
You could use the following regex: this will get all the data from within all of the 'cd' tags in your file..
import re
with open("file.rdf", "r") as file:
for lines in file:
pattern = "<cd:.*>(.*)</cd:.*>"
output = re.findall(pattern, lines)
if len(output) != 0:
print(output[0])
And this outputs:
DPOT-5ab247867d368
arun
ACCESS-5ab247867d370
Research
10
Partial
Yes
age
Sex
Diagnosis
Explaination of the pattern
variable:
.*
tells the script that we want ANY characters that are in this space(.*)
tells the script that this is the section we want to capture.*
does pretty much the same as before, searches for ANY character.Note: I have involved a if statement to check if the output (which is in list form) contains any elements, if not, it excludes it from the output. (for example your heading RDF elements will be excluded).
Upvotes: 1
Reputation: 168626
The rdf:
and cd:
are namespace tags. They need to be replaced in your search with the actual namespace identifiers, like so:
description = content.find('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description')
target = description.find("{http:xyz.com#}target")
Upvotes: 1
Reputation: 7238
You can use the BeautifulSoup
module with its XML parser.
from bs4 import BeautifulSoup
XML = '''
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http:xyz.com#">
<rdf:Description rdf:about="http:xyz.com#">
<cd:algorithmid>DPOT-5ab247867d368</cd:algorithmid>
<cd:owner>arun</cd:owner>
<cd:acesskey>ACCESS-5ab247867d370</cd:acesskey>
<cd:purpose>Research</cd:purpose>
<cd:metadata>10</cd:metadata>
<cd:completeness>Partial</cd:completeness>
<cd:completeness>Yes</cd:completeness>
<cd:inclusion_1>age</cd:inclusion_1>
<cd:feature_1>Sex</cd:feature_1>
<cd:target>Diagnosis</cd:target>
</rdf:Description>
</rdf:RDF>'''
soup = BeautifulSoup(XML, 'xml')
target = soup.find('target').text
print(target)
# Diagnosis
As you can see, it's pretty easy to use.
Upvotes: 2
Reputation: 33335
The cd:
part is a namespace. They're pretty common in XML, and just about any XML parser has a way to handle them.
Otherwise, if you are just looking for a single item and you don't care about structure, you could just do a simple string search and grab everything between <cd:target>
and </cd:target>
, like so:
rdf = '''rdf xml document'''
open_tag = '<cd:target>'
close_tag = '</cd:target>'
start = rdf.find(open_tag)
end = rdf.find(close_tag)
value = rdf[start + len(open_tag):end]
Upvotes: 0