Reputation: 15629
I have a csv file that contains 10,000s of rows. Each row has 8 columns. One of those columns contains text similar to this:
this is a row: http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row: http://yetanotherdomain.net
this is a row: https://hereisadomain.org | some_text
I'm currently accessing the data in this column this way:
for row in csv_reader:
the_url = row[3]
# this regex is used to find the hrefs
href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
for link in href_regex:
print (link)
Output from the print statement:
http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text
How do I obtain only the URLs?
http://somedomain.com
http://someanotherdomain.com
http://yetanotherdomain.net
https://hereisadomain.org
Upvotes: 0
Views: 36
Reputation: 26163
Just change your pattern to:
\b(?:http|ftp)s?://\S+
Instead of matching anything with .*
, match any non-whitespace characters instead with \S+
. You might want to add a word boundary before your non capturing group, too.
Check it live here.
Upvotes: 2
Reputation: 371019
Instead of repeating any character at the end
'(?:http|ftp)s?://.*'
^
repeat any character except a space, to ensure that the pattern will stop matching at the end of a URL:
'(?:http|ftp)s?://[^ ]*'
^^^^
Upvotes: 1