Thomas
Thomas

Reputation: 4719

regex to extract part of filename

I want to extract part of a filename that is contained in a xml string

Sample

<assets>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf7.JPG"  valign="top"/>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf5.JPG"  valign="top"/>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf4.JPG"  valign="top"/>
</assets>

I want to match and retrieve the 560PEgnR portion from all entries, regardless of the filename

So far I have

/assets/(.*)/*"

But it doesn't do what I want

Any help appreciated

Thanks

Upvotes: 2

Views: 1265

Answers (5)

Acorn
Acorn

Reputation: 50497

Properly parsing the XML and avoiding the unnecessary use of regex:

from lxml import etree

xml = """
<assets>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf7.JPG"  valign="top"/>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf5.JPG"  valign="top"/>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf4.JPG"  valign="top"/>
</assets>
"""

xmltree = etree.fromstring(xml)

for media in xmltree.iterfind(".//media"):
    path = media.get('img')
    print path.split('/')[-2]

Gives:

560PEgnR
560PEgnR
560PEgnR

Upvotes: 1

ghostdog74
ghostdog74

Reputation: 342363

A non-regex approach

>>> string="""
... <assets>  
... <media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf7.JPG"  valign="top"/>
... <media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf5.JPG"  valign="top"/>
... <media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf4.JPG"  valign="top"/>
... </assets>                                                                                  
... """           

>>> for line in string.split("\n"):
...     if "/assets/" in line:
...         print line.split("/assets/")[-1].split("/")[0]
...
560PEgnR
560PEgnR
560PEgnR

Upvotes: 2

hsz
hsz

Reputation: 152216

You should try with:

/assets/(.*?)/.*

.* is gready, but using ? it stops on the first /.

Upvotes: 3

alex
alex

Reputation: 490233

Alternatively...

/assets/([^/])+/

Upvotes: 4

Stephan
Stephan

Reputation: 7388

There are several alternatives. Your mistake is that your .* part also includes the '/', so either you make it less greedy (as hsz proposed above) or you exclude a '/' from the matching group like this /assets/([^/]*).*.

Upvotes: 2

Related Questions