Reputation: 18705
I'm trying to download a txt file which you can find here. Downloading the file is not a problem:
testfile = urllib.URLopener()
testfile.retrieve(_proxy_list_download_, "proxies.txt")
But the problem is that when it is downloaded it acts weird. When I open it in any txt editor, I can see the content and IP addresses but when I try to print the content into the console it prints this:
212.3.183.210:8080; 0; 0; anonymous proxy; Italy; ; a; in); an Jose); ree download proxy IP
And when I try to get IP addresses from there, there is no address in the output.
with open('proxies.txt') as f:
content = f.read()
ip = re.findall( r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$", content )
I've tried already another regex:
r'([0-9]+)(?:\.[0-9]+){3}'
This regex returned only 3-digit numbers.
Do you have any idea how to parse those IPs?
EDIT: Here is the copy+pasted text from text editor but in the editor everything is in one line:
# http://proxy-ip-list.com/ provides you this fresh txt proxy list to free download proxy IP
# Date: Sat, 27 Jun 2015 12:53:02 +0000
39.166.95.9:8123; 0; 0; high-anonymous; China;
178.189.92.118:3129; 16.83; 405; high-anonymous; Austria;
198.2.202.33:8090; 8.05; 884; anonymous; United States (CA, San Jose);
171.96.152.89:8080; 0; 0; anonymous; Thailand;
153.149.104.76:80; 0; 0; anonymous; Japan (Tokyo);
106.187.52.191:80; 0; 0; anonymous proxy; Japan;
194.187.214.204:80; 0.91; 6374; anonymous proxy; Finland;
59.78.160.247:8080; 0; 0; anonymous; China (Shanghai);
61.156.3.166:80; 1.12; 1449; anonymous proxy; China (Jinan);
221.238.140.164:8080; 1.39; 257; anonymous; China (Tianjin);
117.178.157.107:8123; 8.44; 847; high-anonymous; China;
39.166.205.95:8123; 0; 0; high-anonymous; China;
117.163.216.8:8123; 4.21; 1577; high-anonymous; China;
189.31.143.250:3128; 0; 0; high-anonymous; Brazil;
183.89.84.82:8080; 0; 0; anonymous proxy; Thailand;
183.88.41.42:8080; 0; 0; anonymous; Thailand;
212.3.183.210:8080; 0; 0; anonymous proxy; Italy;
Upvotes: 1
Views: 1359
Reputation: 174696
You need to remove anchors, since a line won't contain only a single ip-address.
ip = re.findall( r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", content )
second regex
r'([0-9]+)(?:\.[0-9]+){3}'
must return three digit number because only the first three digits are captured, re.findall
method would return captures first if there any. If there are no captures, then only it would return the matches. By turning the capturing group into non-capturing group will give you the desired output.
r'\b[0-9]+(?:\.[0-9]+){3}\b'
Upvotes: 4