Reputation: 1730
I have extracted some text data from word document(.doc) and stored in a variable my_text
such that
my_text[2] = '2 Running Hrs - \tPort M/E RPM \t- \t'
here \t is the delimiter from the document itself.
I'm trying to extract word/character between a word 'Running Hrs' and '\t' a delimiter.
So that I would get an output '-'
Here I tried
import re
re.search('Running Hrs(.*)\t', my_text[2].strip()).group(1)
Output
' - \tPort M/E RPM '
print(re.findall(r'\Running Hrs([^]\t*)\]', str(my_text[2])))
output
ERROR: error: bad escape \R
Any suggestion on this.
Upvotes: 2
Views: 281
Reputation: 163362
If you want -
as a result, I would suggest using strip in the result of group(1)
instead.
If \t is the delimiter from the document itself, and there are no other occurrences of \t
besides the one at the end, using strip on the whole line will remove that and the pattern will not match.
Instead of using the non greedy .*?
you could use a negated character class [^
instead, matching any char except a tab or a newline.
Running Hrs([^\t\r\n]+)\t
import re
my_text = '2 Running Hrs - \tPort M/E RPM \t- \t'
print(re.search('Running Hrs([^\t\r\n]+)\t', my_text).group(1).strip())
Output
-
Upvotes: 0
Reputation: 1
Your regular expression is nearly correct, but matches as many characters as possible (greedy behaviour). To have the minimial matching characters you may set the behaviour to non-greedy with '?' expression.
Also perform the .strip() after you have extracted the text between the start pattern and the '/t' to remove remaining blanks.
my_text[2] = '2 Running Hrs - \tPort M/E RPM \t- \t'
import re
re.search('Running Hrs(.*?)\t', my_text[2]).group(1).strip()
see: https://docs.python.org/3/library/re.html
Upvotes: 0
Reputation: 676
you can use something like this in your code
start_phrase = 'Running Hrs'
start = my_text[2].index(start_phrase)+len(start_phrase)
end = my_text[2].index('\t')
my_text[2][start:end].strip()
Upvotes: 1
Reputation: 9019
Your first attempt is very close to what you want, as you just need to include a ?
flag to ensure your capturing group is non-greedy, like so:
r'Running Hrs(.*?)\t'
Without this ?
flag, your capturing group is considered greedy and will attempt to match as much as possible up until the last \t
, whereas a non-greedy expression will only capture up until the first \t
.
Upvotes: 4