re: matching 'a href' tag

Question

I have this simple program that takes in a file from stdin and output only the host (example: returning only HOST.

Except when I run cat sample.html | python program.py right now it outputs href"=google.com

I want it to remove the 'href=" part and have it only output google.com, but when I tried removing it from my re, it became even worse. Thoughts?

import re
import sys

s = sys.stdin.read()
lines=s.split('
')

match = re.search(r'href=[\'"]?([^\'" >]+)', s) #here
if match:
    print match.group(0)

Thank you.

hwnd · Accepted Answer

That is because you reference group(0) when it should be group(1) which holds the actual match result.

if match:
   print match.group(1)

re: matching 'a href' tag

Answers (1)

Related Questions

re: matching &#39;a href&#39; tag

Answers (1)

Related Questions

re: matching 'a href' tag