Reputation: 21
I want to extract twitter handle for twitter urls like these
1.)https://www.twitter.com/sachin
2.)https://www.twitter.com/@sachin
3.)https://www.twitter.com/@sachin
4.)https://www.twitter.com/sachin?lang=en
output sachin
I am using this regex
import re
match = re.search(r'^(?:.*twitter\.com/@?)(\w{1,15})(?:$|/.*$|,)',twitter_url)
handle = match.group(1)
The url type 1,2,3 are giving result as expected but url type 4 is not giving result and giving this error
AttributeError: 'NoneType' object has no attribute 'group'
Upvotes: 1
Views: 1111
Reputation: 14233
why not use urllib.parse
?
urls = ['https://www.twitter.com/sachin', 'https://www.twitter.com/@sachin',
'https://www.twitter.com/@sachin', 'https://www.twitter.com/sachin?lang=en']
from urllib.parse import urlparse # or urlsplit
for url in urls:
print(urlparse(url).path.lstrip('/@'))
output
sachin
sachin
sachin
sachin
Upvotes: 0
Reputation: 163207
The pattern does not match the 4th example as (\w{1,15})
will match sachin
and the next character is ?
and the pattern tries to match a /
You could optionally match the ?
and the rest of the line or specify all allowed characters using a character class [?/,]
^.*?\btwitter\.com/@?(\w{1,15})(?:[?/,].*)?$
The pattern matches:
^
Start of string.*?
Match any char except a newline as least as possible (or use \S*?
if there can be no spaces)\btwitter\.com/@?
Match twitter.com/ and optional @(\w{1,15})
Capture 1-15 word characters in group 1(?:[?/,].*)?
Optionally match either ?
or /
or ,
and the rest of the line$
End of stringFor example
import re
twitter_urls = [
"https://www.twitter.com/sachin",
"https://www.twitter.com/@sachin",
"https://www.twitter.com/@sachin",
"https://www.twitter.com/sachin?lang=en"
]
for twitter_url in twitter_urls:
match = re.search(r'^.*?\btwitter\.com/@?(\w{1,15})(?:[?/,].*)?$',twitter_url)
if match:
print(match.group(1))
Output
sachin
sachin
sachin
sachin
Upvotes: 1
Reputation: 626690
You can use
r'/@?(\w+)[^/]*$'
See the regex demo.
Details:
/
- a /
char@?
- an optional @
char(\w+)
- Group 1: any one or more letters, digits or _
chars[^/]*
- zero or more chars other than /
$
- end of string.A sample usage with re.search
:
match = re.search(r'/@?(\w+)[^/]*$', twitter_url)
if match: # Check if there is a match
print(match.group(1))
else:
print("No match") # Action upon no match
Upvotes: 1