Andre_k
Andre_k

Reputation: 1730

Extract words between a word and a delimiter python

I have extracted some text data from word document(.doc) and stored in a variable my_text such that

my_text[2] = '2 Running Hrs                         -  \tPort M/E RPM  \t-  \t'

here \t is the delimiter from the document itself. I'm trying to extract word/character between a word 'Running Hrs' and '\t' a delimiter. So that I would get an output '-'
Here I tried

  1. Trial 1

import re
re.search('Running Hrs(.*)\t', my_text[2].strip()).group(1)

Output

 '                         -  \tPort M/E RPM  '
  1. Trial 2

print(re.findall(r'\Running Hrs([^]\t*)\]', str(my_text[2])))

output

ERROR: error: bad escape \R

Any suggestion on this.

Upvotes: 2

Views: 281

Answers (4)

The fourth bird
The fourth bird

Reputation: 163362

If you want - as a result, I would suggest using strip in the result of group(1) instead.

If \t is the delimiter from the document itself, and there are no other occurrences of \t besides the one at the end, using strip on the whole line will remove that and the pattern will not match.

Instead of using the non greedy .*? you could use a negated character class [^ instead, matching any char except a tab or a newline.

Running Hrs([^\t\r\n]+)\t

Regex demo | Python demo

import re

my_text = '2 Running Hrs                         -  \tPort M/E RPM  \t-  \t'
print(re.search('Running Hrs([^\t\r\n]+)\t', my_text).group(1).strip())

Output

-

Upvotes: 0

jlink1988
jlink1988

Reputation: 1

Your regular expression is nearly correct, but matches as many characters as possible (greedy behaviour). To have the minimial matching characters you may set the behaviour to non-greedy with '?' expression.

Also perform the .strip() after you have extracted the text between the start pattern and the '/t' to remove remaining blanks.

my_text[2] = '2 Running Hrs                         -  \tPort M/E RPM  \t-  \t'

import re
re.search('Running Hrs(.*?)\t', my_text[2]).group(1).strip()

see: https://docs.python.org/3/library/re.html

Upvotes: 0

Himanshu
Himanshu

Reputation: 676

you can use something like this in your code

start_phrase = 'Running Hrs'

start = my_text[2].index(start_phrase)+len(start_phrase)
end = my_text[2].index('\t')

my_text[2][start:end].strip()

Upvotes: 1

rahlf23
rahlf23

Reputation: 9019

Your first attempt is very close to what you want, as you just need to include a ? flag to ensure your capturing group is non-greedy, like so:

r'Running Hrs(.*?)\t'

Without this ? flag, your capturing group is considered greedy and will attempt to match as much as possible up until the last \t, whereas a non-greedy expression will only capture up until the first \t.

Upvotes: 4

Related Questions