Life is complex
Life is complex

Reputation: 15629

Only output matching regex pattern

I have a csv file that contains 10,000s of rows. Each row has 8 columns. One of those columns contains text similar to this:

this is a row:   http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row:   http://yetanotherdomain.net
this is a row:   https://hereisadomain.org | some_text

I'm currently accessing the data in this column this way:

for row in csv_reader:
    the_url = row[3]

    # this regex is used to find the hrefs
    href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
    for link in href_regex:
         print (link)

Output from the print statement:

http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text

How do I obtain only the URLs?

http://somedomain.com
http://someanotherdomain.com 
http://yetanotherdomain.net
https://hereisadomain.org

Upvotes: 0

Views: 36

Answers (2)

Paolo
Paolo

Reputation: 26163

Just change your pattern to:

\b(?:http|ftp)s?://\S+

Instead of matching anything with .*, match any non-whitespace characters instead with \S+. You might want to add a word boundary before your non capturing group, too.

Check it live here.

Upvotes: 2

CertainPerformance
CertainPerformance

Reputation: 371019

Instead of repeating any character at the end

'(?:http|ftp)s?://.*'
                  ^

repeat any character except a space, to ensure that the pattern will stop matching at the end of a URL:

'(?:http|ftp)s?://[^ ]*'
                  ^^^^

Upvotes: 1

Related Questions