Reputation: 619
I am trying to extract URLs from a .txt file using regex (all the URLs end with .jpeg). This is my regex:
import re
output = re.findall('(http)(.*?)(jpeg)', text)
but my output looks like this:
('http', ://d1spq65clhrg1f.cloudfront.net/uploads/image_request/image/182/182382/182382534/cloudsight.', 'jpeg')
How can I avoid having the commas dividing the matches?
Upvotes: 0
Views: 573
Reputation: 397
import re
with open("urls.txt") as f:
urls = re.findall('(http*.*?jpeg)', f.read())
print urls
Upvotes: 0
Reputation: 61
Try this
import re
output = re.findall('(http.*?jpeg)', text)
Output:
['http://d1spq65clhrg1f.cloudfront.net/uploads/image_request/image/182/182382/182382534/cloudsight.jpeg']
This will make "re.findall" to capture only one group - "http.*?jpeg", not three as in your regex.
Upvotes: 1
Reputation: 6088
output = re.findall('https?:.*?.jpeg', text)
Example
import re
text=" asdd adf sdf sf http://d1spq65clhrg1f.cloudfront.net/uploads/image_request/image/182/182382/182382534/cloudsight.jpeg asfd ads f ads asdfadfasf asd asdf asdf asdf as"
output = re.findall('https?:.*?.jpeg', text)
print(output)
Ouput:
Upvotes: 0
Reputation: 708
I am not sure if you are looking for this
import re
output = " ".join(re.findall('(http)(.*?)(jpeg)', text))
Upvotes: 0