Reputation: 93783
Sorry, I know this is probably a duplicate but having searched for 'python regular expression match between' I haven't found anything that answers my question!
The document (which to make clear, is a long HTML page) I'm searching has a whole bunch of strings in it (inside a JavaScript function) that look like this:
link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};
I want to extract the links (i.e. everything between quotes within these strings) - e.g. /Hidden/SidebySideYellow/dei1=1204970159862
To get the links, I know I need to start with:
re.matchall(regexp, doc_sting)
But what should regexp
be?
Upvotes: 0
Views: 2854
Reputation: 342303
Use a few simple splits
>>> s="link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
>>> s.split("'")
['link: ', '/Hidden/SidebySideGreen/dei1=1204970159862', '};']
>>> for i in s.split("'"):
... if "/" in i:
... print i
...
/Hidden/SidebySideGreen/dei1=1204970159862
>>>
Upvotes: 0
Reputation: 387557
The answer to your question depends on how the rest of the string may look like. If they are all like this link: '<URL>'};
then you can do it very simple using simple string manipulation:
myString = "link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
print( myString[7:-3] )
(If you just have one string with multiple lines by that, you can just split the string into lines.)
If it is a bit more complex though, using regular expressions are fine. One example that just looks for the url inside of the quotes would be:
myDoc = """link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};"""
print( re.findall( "'([^']+)'", myDoc ) )
Depending on how the whole string looks, you might have to include the link:
as well:
print( re.findall( "link: '([^']+)'", myDoc ) )
Upvotes: 3
Reputation: 2417
I'd start with:
regexp = "'([^']+)'"
And check if it works okay - I mean, if the only condition is that string is in one line between '', it should be good as it is.
Upvotes: 1