remy boys
remy boys

Reputation: 2948

Extract a matching substring in a python string

I'm trying to extract a substring from a large string that matches my pattern.

text = 'This is a large subsring. bla bla bla AND www.dumbweb.com/Dumbo and www.otherLinks.com...'

pattern = 'dumbweb.com'

here i'm trying to find the string that matches pattern

theLink = re.findall(pattern, text)
print(theLink)  //output: dumbweb.com

but i'm only able to find the exact text that i'm searching with, i'm trying to get the full string split by space

desired output:

theLink //www.dumbweb.com/Dumbo

i tired searching for similar question but i'm not able to phrase it right, i even looked up the Python Regex still not able to achieve what i'm looking for.

Upvotes: 1

Views: 2228

Answers (5)

Saravanan
Saravanan

Reputation: 911

Your pattern should be

pattern = "www\.dumbweb\.com[^\\s]*"

This will print the link starting from www.dumbweb.com until there's a trailing space

Upvotes: 1

kelyen
kelyen

Reputation: 242

Probably not the cleanest solution:

text = 'This is a large subsring. bla bla bla AND www.dumbweb.com/Dumbo and www.otherLinks.com...'

pattern = 'dumbweb.com'

for word in text.split():
    if word.find(pattern) > 0:
        print(word)

Upvotes: 1

Jacek Błocki
Jacek Błocki

Reputation: 563

Try this:

re.search('dumbweb.com[\S]*', text).group() 
# matches your string followed by any character but white space 

Upvotes: 1

You could try this:

[^ ]*dumbweb\.com[^ ]*

Note that in regex a . matches any character. You need to use \. to match only a literal period

Upvotes: 1

anubhava
anubhava

Reputation: 784998

You may consider this approach:

import re
text = 'This is a large subsring. bla bla bla AND www.dumbweb.com/Dumbo and www.otherLinks.com...'
pattern = 'dumbweb.com'

rex = re.compile(r'\b' + r'\S*' + re.escape(pattern) + r'\S*')
print (rex.findall(text))

Output:

['dumbweb.com/Dumbo']

Explanation:

  • re.compile(...): compiles a given string regex pattern
  • r'\b': Word boundary
  • r'\S*': Match 0 or more non-whitespace characters
  • re.escape(pattern): Perform regex escape of the given string
  • r'\S*': Match 0 or more non-whitespace characters

Upvotes: 4

Related Questions