Reputation:
I have a list like this:
['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
From this I want to make sublists like:
id = ["32a45", "32a47", "32a48"]
date=["2017-01-01", "2017-01-05", "2017-01-07"]
How can I do that?
Thanks.
Edit: This was the original question It is a broken xml file and tags are messed up, hence I cannot use xmltree. So I am trying something else.
Upvotes: 2
Views: 143
Reputation: 18906
Parsing with ET:
import xml.etree.ElementTree as ET
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
id_ = []
date = []
for string in strings:
tree = ET.fromstring(string+"</text>") #corrects wrong format
id_.append(tree.get("id"))
date.append(tree.get("date"))
print(id_) # ['32a45', '32a47', '32a48']
print(date) # ['2017-01-01', '2017-01-05', '2017-01-07']
Update, full compact example: According to your original problem described here: How can I build an sqlite table from this xml/txt file using python?
import xml.etree.ElementTree as ET
import pandas as pd
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
cols = ["id","language","date","time","timezone"]
data = [[ET.fromstring(string+"</text>").get(col) for col in cols] for string in strings]
df = pd.DataFrame(data,columns=cols)
id language date time timezone
0 32a45 ENG 2017-01-01 11:00 Eastern
1 32a47 ENG 2017-01-05 1:00 Central
2 32a48 ENG 2017-01-07 3:00 Pacific
Now you can use: df.to_sql()
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
Upvotes: 1
Reputation: 701
More easier way to understand with re
module:
Here is the code :
l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
import re
id =[]
dates= []
for i in l:
id.append(re.search(r'id="(.+?)"',i, re.M|re.I).group(1))
dates.append(re.search(r'date="(.+?)"',item, re.M|re.I).group(1))
Output:
print id #id= ['32a45', '32a47', '32a48']
print dates #dates= ['2017-01-07', '2017-01-07', '2017-01-07']
Upvotes: 0
Reputation: 8378
Not as elegant as @RomanPerekhrest solution using re
but here it goes:
def extract(lst, kwd):
out = []
for t in lst:
index1 = t.index(kwd) + len(kwd) + 1
index2 = index1 + t[index1:].index('"') + 1
index3 = index2 + t[index2:].index('"')
out.append(t[index2:index3])
return out
Then
>>> extract(lst, kwd='id')
['32a45', '32a47', '32a48']
Upvotes: 0
Reputation: 1468
Along with other answers who are better you can parse the data manually (more simple):
for line in lines:
id = line[line.index('"')+1:]
line = id
line = id[line.index('"')+1:]
id = id[:id.index('"')]
print('id: ' + id)
You can then simply push it in the new list, repeat the same process for other values below simply change the variable name.
Upvotes: 0
Reputation: 1101
As your provided data appears to be broken/partial xml fragments I would personally try repairing the xml and using the xml.etree
module to extract the data. However if you have correct xml that you have got your current list from, then it would be easier to use the xml.etree
module on that data.
An example solution using xml.etree
:
from xml.etree import ElementTree as ET
data = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
ids = []
dates = []
for element in data:
#This wraps the element in a root tag and gives it a closing tag to
# repair the xml to a valid format.
root = ET.fromstring('{}</text>'.format(element))
#As we have formatted the xml ourselves we can guarantee that it's first
# child will always be the desired element.
ids.append(root.attrib['id'])
dates.append(root.attrib['date'])
print(ids) # ['32a45', '32a47', '32a48']
print(dates) # ['2017-01-01', '2017-01-05', '2017-01-07']
Upvotes: 0
Reputation: 92854
Simple solution using re.search()
function:
import re
l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
ids, dates = [], []
for i in l:
ids.append(re.search(r'id="([^"]+)"', i).group(1))
dates.append(re.search(r'date="([^"]+)"', i).group(1))
print(ids) # ['32a45', '32a47', '32a48']
print(dates) # ['2017-01-01', '2017-01-05', '2017-01-07']
Upvotes: 5
Reputation: 3851
id = [i.split(' ')[1].split('=')[1].strip('"') for i in list]
date = [i.split(' ')[3].split('=')[1].strip('"') for i in list]
But the file looks strange, if the original file is html or xml, there are better ways to get data.
Upvotes: 0