Blaszard
Blaszard

Reputation: 31953

How can I get the full match on Python re module without including keyword simultaneously?

In the following example:

"noun 1 left and right sides 左右摇摆 zuǒ-yòu yáobǎi vacillating; unsteady; hesitant 主席台左右, 红旗迎风飘扬。 Zhǔxítái zuǒyòu, hóngqí yíngfēng piāoyáng. Red flags are fluttering on both sides of the rostrum. 2 [after a numeral] about; or so 八点钟左右 bā diǎn zhōng zuǒyòu around eight o'clock 一个月左右 yī ge yuè zuǒyòu a month or so 身高一米七左右 Shēngāo yī mǐ qī zuǒyòu be about 1.70 metres in height 价值十元左右。 Jiàzhí shí yuán zuǒyòu. It's worth about 10 yuan. 3 those in close attendance; retinue 屏退左右 Píng tuì zuǒyòu order one's attendants to clear out verb master; control; influence 左右局势 zuǒyòu júshì be master of the situation; in control 为人所左右 wéi rén suǒ zuǒyòu controlled by another; fall under another’s influence 他这个人不是别人能左右得了的。 Tā zhège rén bù shì biéren néng zuǒyòu déle de. He is not a man to be influenced by others. adverb dialect anyway; anyhow; in any case 左右闲没事, 我就陪你走一趟吧。 Zuǒyòu xiánzhe méishì, wǒ jiù péi nǐ zǒu yī tàng ba. Ānyway I’m free now so let me go with you."

I would like to get the string separated based on the noun, adjective, adverb, etc... and also based on the number, if they have multiple.

So the final result should be:

        noun
         ["left and right sides", "左右摇摆 zuǒ-yòu yáobǎi vacillating; unsteady; hesitant 主席台左右, 红旗迎风飘扬。 Zhǔxítái zuǒyòu, hóngqí yíngfēng piāoyáng. Red flags are fluttering on both sides of the rostrum."]
         ["[after a numeral] about; or so", "八点钟左右 bā diǎn zhōng zuǒyòu around eight o'clock 一个月左右 yī ge yuè zuǒyòu a month or so 身高一米七左右 Shēngāo yī mǐ qī zuǒyòu be about 1.70 metres in height 价值十元左右。 Jiàzhí shí yuán zuǒyòu. It's worth about 10 yuan."]
         ["those in close attendance; retinue", "屏退左右 Píng tuì zuǒyòu order one's attendants to clear out"]
        verb
            ["master; control; influence", "左右局势 zuǒyòu júshì be master of the situation; in control 为人所左右 wéi rén suǒ zuǒyòu controlled by another; fall under another’s influence 他这个人不是别人能左右得了的。 Tā zhège rén bù shì biéren néng zuǒyòu déle de. He is not a man to be influenced by others."]
        adverb
            ["dialect anyway; anyhow; in any case", "左右闲没事, 我就陪你走一趟吧。 Zuǒyòu xiánzhe méishì, wǒ jiù péi nǐ zǒu yī tàng ba. Ānyway I’m free now so let me go with you"]

The noun, verb, and adverb should be keys, while the value might be a dict. Since noun has three objects here, it should have three distinctive results.

So the first step is take the component from noun, adjective adverb, verb, etc... and store it to some variables. But in this case, I fail to get the relevant result based on the specific string. For example:

re.findall("(noun|verb|adverb|adjective)", s)

This returns ['noun', 'verb', 'adverb'] as it only focuses on the exact match.

So I added .+ to make it re.findall("(noun|verb|adverb|adjective).+", s) and get any word after noun, but then it caught all the strings after noun, including any strings after verb or adverb (and returns ['noun']).

So I hit the wall. Is it possible to get the relevant part but also get the full result except the keyword match?

Upvotes: 1

Views: 72

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

You may use

(?s)(noun|verb|adverb|adjective)(.*?)(?=(?:noun|verb|adverb|adjective|$))

See the regex demo

Details

  • (?s) - an inline re.DOTALL equivalent
  • (noun|verb|adverb|adjective) - Group 1: a word noun, verb, adverb or adjective
  • (.*?) - Group 2: any 0+ chars as few as possible, up to (but excluding) the first occurrence of:
  • (?=(?:noun|verb|adverb|adjective|$)) - either noun, verb, adverb, adjective or end of string (as it is a positive lookahead, (?=...), the texts matched do not become part of a match).

In Python, use with re.findall:

re.findall(r'(?s)(noun|verb|adverb|adjective)(.*?)(?=(?:noun|verb|adverb|adjective|$))', s)

Upvotes: 1

BoarGules
BoarGules

Reputation: 16942

This is not a job for a regular expression. What you are trying to match is too variable.

Write a proper grammar for the dictionary entry, as if it were a programming language, and then parse your data according to that grammar.

Like this:

  1. Your language keywords are noun, verb, adverb.
  2. Each introduces one unnumbered or several numbered definitions.
  3. Numbering of numbered definitions increases monotonically, so other numbers appearing inside a definition should be treated as part of the definition and not start a new one.

As a sometime lexicographer I would also recommend that you should treat labels like dialect (which are generally drawn from a standard vocabulary) as optional keywords rather than as part of the definition.

Upvotes: 2

Karl Knechtel
Karl Knechtel

Reputation: 61498

Probably the easiest thing will be to re.split the string on the part-of-speech pattern first: re.split('(noun|adjective|verb|adverb)', s). For the provided input, this include an empty item at the start, and then the rest will alternate between part-of-speech labels and the bits in between, which you can then process further.

Upvotes: 1

Related Questions