Mahhos
Mahhos

Reputation: 101

extract the file path from text

I have a text file which contains various types of paths and directories as well as some URLs. I am trying to get different paths excluding URLs and Windows directories (c:/).

txt = r'''
\Files\System\ado\

C:\Dir\me\match1\poq!"&\file.txt

http://example/uploads/ssh/

{drive of encrypted files}\FreezedByWizard.README.TXT

%Program Files%\Common Files\System\ado\

/home/user/web/other.longextension
'''

The correct output:

\Files\System\ado\

{drive of encrypted files}\

%Program Files%\Common Files\System\ado\

/home/user/web/

I have tried various regexes including these ones but I could not get the correct results.

pattern = re.compile(r'(?:/[^/]+)*',re.I)
# pattern = re.compile(r'\b(\\.+\\|\/.+\/|\%.+\%)(?:[^\/]|\\\/)+?\b',re.I) 
# this one for example prints all subdirectories not the main one!

matches = re.findall(pattern,txt)
print(matches)

Upvotes: 1

Views: 2121

Answers (3)

koks der drache
koks der drache

Reputation: 1849

Since readability counts I'd suggest to not write your own regular expression but use os.path.dirname and urllib.parse.urlparse. The latter matches both URLs and filepaths starting with C:\

from os.path import dirname, join
from urllib.parse import urlparse

txt = r'''
\Files\System\ado\
C:\Dir\me\match1\poq!"&\file.txt
http://example/uploads/ssh/
{drive of encrypted files}\FreezedByWizard.README.TXT
%Program Files%\Common Files\System\ado\
/home/user/web/other.longextension
'''

result = [dirname(line) for line in txt.split("\n") if not urlparse(line).scheme]

The result is:

\Files\System\ado
{drive of encrypted files}
%Program Files%\Common Files\System\ado
/home/user/web

If the trailing (back)slashes are required you can easily add them by using os.path.join.

result = [join(dirname(line), '') for line in txt.split("\n") if not urlparse(line).scheme]

Now result contains the following entries:

\Files\System\ado\
{drive of encrypted files}\
%Program Files%\Common Files\System\ado\
/home/user/web\

Upvotes: 2

vs97
vs97

Reputation: 5859

I noticed that the lines you want either start with \ { % or /. Maybe something as simple as this will work for you?

^(?:\\|\{|\%|\/).+(?:\\|/)

enter image description here

Regex Demo

^                start at line start
(?:\\|\{|\%|\/)  non-matching group with different string starts
.+               match any character
(?:\\|/)         match until reaches \ or /

Online Code Editor

pattern = re.compile(r'^(?:\\|\{|\%|\/).+(?:\\|/)', re.M)

Upvotes: 5

Xukrao
Xukrao

Reputation: 8634

I'm not sure if I've fully understood your intentions about what exactly you're looking to capture, but the code below should produce your desired output for the examples that you gave.

pattern = re.compile(r'(?:^\\.+\\)|(?:^%.+%\\.+\\)|(?:^{.+}\\(?:.+\\)?)|(?:^/.+/)', re.I | re.M)
matches = re.findall(pattern, txt)
print(*matches, sep='\n')

prints as output:

\Files\System\ado\
{drive of encrypted files}\
%Program Files%\Common Files\System\ado\
/home/user/web/

Clarification of the used regex pattern can be found here.

Upvotes: 2

Related Questions