gsb
gsb

Reputation: 5640

accessing html parsed data in python using lists

I have parsed a html document in python and i am storing the contents of the body tag in a list. Below is the code:

import urllib, re
text = urllib.urlopen("http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&g=p&v=algorithm&v=javed").read()
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)
print data

the output of the following is:

        6          3
    12603        235          1
    37210        363          3
    64618        348          2
        4          4
    80073        560          1
    80560        504          1
    80875        807          1
    80917        636          1

I want to store each new line in a new list. Need help in doing this. I am new to python. Thanks, ghbhatt.

Upvotes: 0

Views: 110

Answers (5)

user647772
user647772

Reputation:

#!/bin/python

data = """6          3
    12603        235          1
    37210        363          3
    64618        348          2
        4          4
    80073        560          1
    80560        504          1
    80875        807          1
    80917        636          1"""

lists = [line.split() for line in data.split("\n")]

print lists

Edit: data.splitlines() is probably more portable than data.split("\n").

Upvotes: 2

ptitpoulpe
ptitpoulpe

Reputation: 694

I'm not sure is that you want:

[re.findall(r'\d+', line) for line in data.split('\n')]

Upvotes: 0

Glenn
Glenn

Reputation: 5786

Don't use regex to parse html: RegEx match open tags except XHTML self-contained tags

Instead, there are a number of great parsers in Python:

http://www.crummy.com/software/BeautifulSoup/

http://lxml.de/

Use one of those and, in general, getting a list of the contents will just be part of what the library does.

Upvotes: 3

Mariusz Jamro
Mariusz Jamro

Reputation: 31663

Use split method to split the string into lines and than to particular columns:

import urllib, re
text = urllib.urlopen("http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&g=p&v=algorithm&v=javed").read()
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)

list_data = []
data_lines = data.split("\n")  # Split the string to list of lines
for line in data_lines: 
    row = line.split()  # Split the line to numbers
    list_data.append(row)

for row in list_data:
    print row

Upvotes: 1

eumiro
eumiro

Reputation: 213005

l = []
for line in data.splitlines():
    l.append(line.split())

or

l = [line.split() for line in data.splitlines()]

l is now:

[['6', '3'],
 ['12603', '235', '1'],
 ['37210', '363', '3'],
 ['64618', '348', '2'],
 ['4', '4'],
 ['80073', '560', '1'],
 ['80560', '504', '1'],
 ['80875', '807', '1'],
 ['80917', '636', '1']]

This stores the data as list of lists of strings. If you know there are integers only, you can do:

l = []
for line in data.splitlines():
    l.append([int(a) for a in line.split()])

or

l = []
for line in data.splitlines():
    l.append(map(int, line.split()))

or

l = [map(int, line.split()) for line in data.splitlines()]

which creates:

[[6, 3],
 [12603, 235, 1],
 [37210, 363, 3],
 [64618, 348, 2],
 [4, 4],
 [80073, 560, 1],
 [80560, 504, 1],
 [80875, 807, 1],
 [80917, 636, 1]]

Upvotes: 2

Related Questions