Reputation: 73
my purpose of this code is to extract all the integers from the text and sum them up together.
I have been looking for solutions to pluck out all the integers in a line of text. I saw some solutions suggesting to use \D
and \b
, I just got started with regular expression and still unfamiliar with how it can fit into my code. Please help :(
import re
import urllib2
data = urllib2.urlopen("http://python-data.dr-chuck.net/regex_sum_179860.txt")
aList = []
for word in data:
data = (str(w) for w in data)
s = re.findall(r'[\d]+', word)
if len(s) != 1: continue
num = int(s[0])
aList.append(num)
print aList
Upvotes: 6
Views: 290
Reputation: 180481
You can do it line by line, call findall
using the pattern "\d+"
for one or more digits and extending your output list:
import re
import urllib2
data = urllib2.urlopen("http://python-data.dr-chuck.net/regex_sum_179860.txt")
r = re.compile("\d+")
l = []
for line in data:
l.extend(map(int,r.findall(line)))
Output:
[3524, 9968, 6177, 3133, 6508, 7940, 3738, 1112, 6179, 4570, 6127, 9150, 9883, 418, 3538, 2992, 8527, 1150, 2049, 2834, 2630, 3840, 2638, 3800, 9144, 5866, 6742, 588, 6918, 7802, 8229, 7947, 8992, 1339,
2119, 846, 3820, 4070, 9356, 9708, 3238, 9380, 5572, 9491, 3038,
7434, 7771, 288, 8632, 3962, 9136, 8106, 7295, 3699, 4136, 3459, 8120,
6018, 8963, 5779, 3635, 3984, 4850, 9633, 2588, 7631, 9591, 1067,
7182, 1301, 8041, 1361, 5425, 8326, 7094, 8155, 2581, 7199, 6125, 42]
You could also use str.isdigit
:
l = []
for line in data:
l.extend(map(int,(w for w in line.split() if w.isdigit())))
If you just want to sum
the numbers, you don't need to store all the numbers at all:
print(sum(sum(map(int,(w for w in line.split() if w.isdigit()))) for line in data))
Output:
435239
Or using a regex:
print(sum(sum(map(int,r.findall(line))) for line in data))
Probably irrelevant in your case but if you wanted to avoid any intermediary lists using python2 you could use itertools.imap
:
from itertools import imap
print(sum(sum(imap(int,r.findall(line))) for line in data))
Upvotes: 4
Reputation: 369334
read
of the return value of the urllib2.urlopen
; The return value of urllib2.urlopen
is not a string, but a connection object (file-like object)re.findall
to the data
.\d
are not necessary.import re
import urllib2
data = urllib2.urlopen("http://python-data.dr-chuck.net/regex_sum_179860.txt").read()
int_list = map(int, re.findall(r'\d+', data))
>>> int_list
[3524, 9968, 6177, 3133, 6508, 7940, 3738, 1112, 6179, 4570, 6127, 9150,
9883, 418, 3538, 2992, 8527, 1150, 2049, 2834, 2630, 3840, 2638, 3800,
9144, 5866, 6742, 588, 6918, 7802, 8229, 7947, 8992, 1339, 2119, 846,
3820, 4070, 9356, 9708, 3238, 9380, 5572, 9491, 3038, 7434, 7771, 288,
8632, 3962, 9136, 8106, 7295, 3699, 4136, 3459, 8120, 6018, 8963, 5779,
3635, 3984, 4850, 9633, 2588, 7631, 9591, 1067, 7182, 1301, 8041, 1361,
5425, 8326, 7094, 8155, 2581, 7199, 6125, 42]
Upvotes: 5
Reputation: 6574
Since you mentioned you wanted to sum all integers, this will work in Python 3 (as urllib2
has been split across several modules in Python 3 named urllib.request
and urllib.error
):
from urllib import request
import re
data = request.urlopen("http://python-data.dr-chuck.net/regex_sum_179860.txt")
result = 0
for word in data:
result += sum([int(x) for x in re.findall(r'\d+', str(word))])
print(result)
Upvotes: 1