Reputation: 33
I'm trying to get rid of numbers from site response http://app.lotto.pl/wyniki/?type=dl with code below
import requests
import re
url = 'http://app.lotto.pl/wyniki/?type=dl'
p = re.compile(r'[^\d{4}\-\d{2}\-\d{2}]\d+')
response = requests.get(url)
data = re.findall(p, response.text)
print(data)
but instead of ['7', '46', '8', '43', '9', '47']
I'm getting ['\n7', '\n46', '\n8', '\n43', '\n9', '\n47']
How can I get rid of "\n"
?
Upvotes: 2
Views: 5203
Reputation: 627103
Your regex is not appropriate because [^\d{4}\-\d{2}\-\d{2}]\d+
matches any character but a digit, {
, 4
, }
, -
, 2
and then 1 or more digits. In other words, you turned a sequence into a character set. And that negated character class can match a newline. It can match any letter, too. And a lot more. strip
will not help in other contexts, you need to fix the regular expression.
Use
r'(?<!-)\b\d+\b(?!-)'
See the regex and IDEONE demo
This pattern will match 1+ digits (\d+
) that are not preceded with a hyphen ((?<!-)
) or any word characters (\b
) and is not followed with a word character (\b
) or a hyphen (-
).
You code will look like:
import requests
import re
url = 'http://app.lotto.pl/wyniki/?type=dl'
p = re.compile(r'(?<!-)\b\d+\b(?!-)')
response = requests.get(url)
data = p.findall(response.text)
print(data)
Upvotes: 3
Reputation: 11042
You can strip \n
using strip()
function
data = [x.strip() for x in re.findall(p, response.text)]
I am assuming that \n
can be in beginning as well as in end
Upvotes: 2
Reputation: 3177
Since your numbers are strings, you can easily use lstrip()
method for strings. Such method will indeed remove newline/carriage return characters at the left side of your string (that's why lstrip).
You can try something like
print([item.lstrip() for item in data])
to remove your newlines.
Or you can as well overwrite data
with the stripped version of itself:
data=[item.lstrip() for item in data]
and then simply print(data)
.
Upvotes: 0