Reputation: 5640
I have parsed a html document in python and i am storing the contents of the body tag in a list. Below is the code:
import urllib, re
text = urllib.urlopen("http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&g=p&v=algorithm&v=javed").read()
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)
print data
the output of the following is:
6 3
12603 235 1
37210 363 3
64618 348 2
4 4
80073 560 1
80560 504 1
80875 807 1
80917 636 1
I want to store each new line in a new list. Need help in doing this. I am new to python. Thanks, ghbhatt.
Upvotes: 0
Views: 110
Reputation:
#!/bin/python
data = """6 3
12603 235 1
37210 363 3
64618 348 2
4 4
80073 560 1
80560 504 1
80875 807 1
80917 636 1"""
lists = [line.split() for line in data.split("\n")]
print lists
Edit: data.splitlines()
is probably more portable than data.split("\n")
.
Upvotes: 2
Reputation: 694
I'm not sure is that you want:
[re.findall(r'\d+', line) for line in data.split('\n')]
Upvotes: 0
Reputation: 5786
Don't use regex to parse html: RegEx match open tags except XHTML self-contained tags
Instead, there are a number of great parsers in Python:
http://www.crummy.com/software/BeautifulSoup/
Use one of those and, in general, getting a list of the contents will just be part of what the library does.
Upvotes: 3
Reputation: 31663
Use split method to split the string into lines and than to particular columns:
import urllib, re
text = urllib.urlopen("http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&g=p&v=algorithm&v=javed").read()
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)
list_data = []
data_lines = data.split("\n") # Split the string to list of lines
for line in data_lines:
row = line.split() # Split the line to numbers
list_data.append(row)
for row in list_data:
print row
Upvotes: 1
Reputation: 213005
l = []
for line in data.splitlines():
l.append(line.split())
or
l = [line.split() for line in data.splitlines()]
l
is now:
[['6', '3'],
['12603', '235', '1'],
['37210', '363', '3'],
['64618', '348', '2'],
['4', '4'],
['80073', '560', '1'],
['80560', '504', '1'],
['80875', '807', '1'],
['80917', '636', '1']]
This stores the data as list of lists of strings. If you know there are integers only, you can do:
l = []
for line in data.splitlines():
l.append([int(a) for a in line.split()])
or
l = []
for line in data.splitlines():
l.append(map(int, line.split()))
or
l = [map(int, line.split()) for line in data.splitlines()]
which creates:
[[6, 3],
[12603, 235, 1],
[37210, 363, 3],
[64618, 348, 2],
[4, 4],
[80073, 560, 1],
[80560, 504, 1],
[80875, 807, 1],
[80917, 636, 1]]
Upvotes: 2