Reputation: 353
I'm working with files in a tar.gz file which contains txt files and trying to extract the filename of a the related TarInfo
object whose member.name
property looks like this:
aclImdb/test/neg/1026_2.txt
aclImdb/test/neg/1027_5.txt
...
aclImdb/test/neg/1030_4.txt
I've written the following code which prints the string test/neg/1268_2
regex = '\/((?:[^/]*/).*?)\.'
with tarfile.open("C:\\Users\\Orestis\\Desktop\\aclImdb_v1.tar.gz") as archive:
for member in archive.getmembers():
if member.isreg():
m = re.findall(regex, member.name)
print(m)
How should I modify the regex to extract only the 1268_2
part of the filenames? Effectively I want to extract the string after the 3rd occurrence of "/"
and before the 1st occurrence of "."
.
Upvotes: 0
Views: 1198
Reputation: 36
You could hardcode this:
.*?\/.*?\/.*?\/(.*?)\.
More elegant is something along the lines of this:
(.*?\/){3}(.*?)\.
You can simply change the 3 to suit your pattern. (Note that the group you'll want is $2)
Upvotes: 2