Reputation: 1088
Suppose I have the following text:
test = '\n\nDisclaimer ...........................\t10\n\nITOM - IT Object Model ...............\t11\n\nDB – Datenbank Model..................\t11\n\nDB - Datenbank Model - Views .........\t12'
which looks like:
Disclaimer ........................... 10
ITOM - IT Object Model ............... 11
DB – Datenbank Model.................. 11
DB - Datenbank Model - Views ......... 12
I want to make a list of the contents such that I get:
['Disclaimer', 'ITOM - IT Object Model', 'DB – Datenbank Model', 'DB - Datenbank Model - Views' ]
so I do the following:
re.findall(r'^[a-zA-Z\%\$\#\@\!\-\_]\S*', test1, re.MULTILINE)
which returns:
['Disclaimer', 'ITOM', 'DB', 'DB']
I wonder why my RegEx doesn't pick the words after -
?
Upvotes: 2
Views: 410
Reputation: 8868
I'm proposing an alternate approach, with a different regex. Replace the unwanted characters, instead of finding the needed ones, as it seems easy for your case.
See below:
contents = re.sub(r"\s?(\.)+\s+(\d)+\b", "", text, re.MULTILINE).splitlines(keepends=False)
This will produce a list of contents you want.
Upvotes: 1
Reputation: 626845
You can use a regex and a non-regex approach here:
[line.split('...')[0].strip() for line in test1.splitlines() if line.strip()]
[re.sub(r'\s*\.+\s*\d+\s*$', '', line) for line in test1.splitlines() if line.strip()]
re.findall(r'^(.*?)[^\S\n]*\.+[^\S\n]*\d+[^\S\n]*$', test1, re.M)
See the Python demo.
Notes:
Or, if you prefer the fully-regex approach (see the third line of code in the above snippet), you can use re.findall
with a ^(.*?)[^\S\n]*\.+[^\S\n]*\d+[^\S\n]*$
pattern:
^
- start of a line(.*?)
- Group 1: any zero or more chars other than line break chars, as few as possible[^\S\n]*
- zero or more horizontal whitespaces\.+
- one or more dots[^\S\n]*
- zero or more horizontal whitespaces\d+
- one or more digits[^\S\n]*
- zero or more horizontal whitespaces$
- end of line.See the regex demo.
Upvotes: 2