Reputation: 101
I have a text file which contains various types of paths and directories as well as some URLs. I am trying to get different paths excluding URLs and Windows directories (c:/).
txt = r'''
\Files\System\ado\
C:\Dir\me\match1\poq!"&\file.txt
http://example/uploads/ssh/
{drive of encrypted files}\FreezedByWizard.README.TXT
%Program Files%\Common Files\System\ado\
/home/user/web/other.longextension
'''
The correct output:
\Files\System\ado\
{drive of encrypted files}\
%Program Files%\Common Files\System\ado\
/home/user/web/
I have tried various regexes including these ones but I could not get the correct results.
pattern = re.compile(r'(?:/[^/]+)*',re.I)
# pattern = re.compile(r'\b(\\.+\\|\/.+\/|\%.+\%)(?:[^\/]|\\\/)+?\b',re.I)
# this one for example prints all subdirectories not the main one!
matches = re.findall(pattern,txt)
print(matches)
Upvotes: 1
Views: 2121
Reputation: 1849
Since readability counts I'd suggest to not write your own regular expression but use os.path.dirname and urllib.parse.urlparse. The latter matches both URLs and filepaths starting with C:\
from os.path import dirname, join
from urllib.parse import urlparse
txt = r'''
\Files\System\ado\
C:\Dir\me\match1\poq!"&\file.txt
http://example/uploads/ssh/
{drive of encrypted files}\FreezedByWizard.README.TXT
%Program Files%\Common Files\System\ado\
/home/user/web/other.longextension
'''
result = [dirname(line) for line in txt.split("\n") if not urlparse(line).scheme]
The result is:
\Files\System\ado
{drive of encrypted files}
%Program Files%\Common Files\System\ado
/home/user/web
If the trailing (back)slashes are required you can easily add them by using os.path.join.
result = [join(dirname(line), '') for line in txt.split("\n") if not urlparse(line).scheme]
Now result
contains the following entries:
\Files\System\ado\
{drive of encrypted files}\
%Program Files%\Common Files\System\ado\
/home/user/web\
Upvotes: 2
Reputation: 5859
I noticed that the lines you want either start with \ { % or /
. Maybe something as simple as this will work for you?
^(?:\\|\{|\%|\/).+(?:\\|/)
^ start at line start
(?:\\|\{|\%|\/) non-matching group with different string starts
.+ match any character
(?:\\|/) match until reaches \ or /
pattern = re.compile(r'^(?:\\|\{|\%|\/).+(?:\\|/)', re.M)
Upvotes: 5
Reputation: 8634
I'm not sure if I've fully understood your intentions about what exactly you're looking to capture, but the code below should produce your desired output for the examples that you gave.
pattern = re.compile(r'(?:^\\.+\\)|(?:^%.+%\\.+\\)|(?:^{.+}\\(?:.+\\)?)|(?:^/.+/)', re.I | re.M)
matches = re.findall(pattern, txt)
print(*matches, sep='\n')
prints as output:
\Files\System\ado\
{drive of encrypted files}\
%Program Files%\Common Files\System\ado\
/home/user/web/
Clarification of the used regex pattern can be found here.
Upvotes: 2