Reputation: 31953
In the following example:
"noun 1 left and right sides 左右摇摆 zuǒ-yòu yáobǎi vacillating; unsteady; hesitant 主席台左右, 红旗迎风飘扬。 Zhǔxítái zuǒyòu, hóngqí yíngfēng piāoyáng. Red flags are fluttering on both sides of the rostrum. 2 [after a numeral] about; or so 八点钟左右 bā diǎn zhōng zuǒyòu around eight o'clock 一个月左右 yī ge yuè zuǒyòu a month or so 身高一米七左右 Shēngāo yī mǐ qī zuǒyòu be about 1.70 metres in height 价值十元左右。 Jiàzhí shí yuán zuǒyòu. It's worth about 10 yuan. 3 those in close attendance; retinue 屏退左右 Píng tuì zuǒyòu order one's attendants to clear out verb master; control; influence 左右局势 zuǒyòu júshì be master of the situation; in control 为人所左右 wéi rén suǒ zuǒyòu controlled by another; fall under another’s influence 他这个人不是别人能左右得了的。 Tā zhège rén bù shì biéren néng zuǒyòu déle de. He is not a man to be influenced by others. adverb dialect anyway; anyhow; in any case 左右闲没事, 我就陪你走一趟吧。 Zuǒyòu xiánzhe méishì, wǒ jiù péi nǐ zǒu yī tàng ba. Ānyway I’m free now so let me go with you."
I would like to get the string separated based on the noun, adjective, adverb, etc... and also based on the number, if they have multiple.
So the final result should be:
noun
["left and right sides", "左右摇摆 zuǒ-yòu yáobǎi vacillating; unsteady; hesitant 主席台左右, 红旗迎风飘扬。 Zhǔxítái zuǒyòu, hóngqí yíngfēng piāoyáng. Red flags are fluttering on both sides of the rostrum."]
["[after a numeral] about; or so", "八点钟左右 bā diǎn zhōng zuǒyòu around eight o'clock 一个月左右 yī ge yuè zuǒyòu a month or so 身高一米七左右 Shēngāo yī mǐ qī zuǒyòu be about 1.70 metres in height 价值十元左右。 Jiàzhí shí yuán zuǒyòu. It's worth about 10 yuan."]
["those in close attendance; retinue", "屏退左右 Píng tuì zuǒyòu order one's attendants to clear out"]
verb
["master; control; influence", "左右局势 zuǒyòu júshì be master of the situation; in control 为人所左右 wéi rén suǒ zuǒyòu controlled by another; fall under another’s influence 他这个人不是别人能左右得了的。 Tā zhège rén bù shì biéren néng zuǒyòu déle de. He is not a man to be influenced by others."]
adverb
["dialect anyway; anyhow; in any case", "左右闲没事, 我就陪你走一趟吧。 Zuǒyòu xiánzhe méishì, wǒ jiù péi nǐ zǒu yī tàng ba. Ānyway I’m free now so let me go with you"]
The noun
, verb
, and adverb
should be keys, while the value might be a dict. Since noun
has three objects here, it should have three distinctive results.
So the first step is take the component from noun
, adjective
adverb
, verb
, etc... and store it to some variables. But in this case, I fail to get the relevant result based on the specific string. For example:
re.findall("(noun|verb|adverb|adjective)", s)
This returns ['noun', 'verb', 'adverb']
as it only focuses on the exact match.
So I added .+
to make it re.findall("(noun|verb|adverb|adjective).+", s)
and get any word after noun
, but then it caught all the strings after noun
, including any strings after verb
or adverb
(and returns ['noun']
).
So I hit the wall. Is it possible to get the relevant part but also get the full result except the keyword match?
Upvotes: 1
Views: 72
Reputation: 626728
You may use
(?s)(noun|verb|adverb|adjective)(.*?)(?=(?:noun|verb|adverb|adjective|$))
See the regex demo
Details
(?s)
- an inline re.DOTALL
equivalent(noun|verb|adverb|adjective)
- Group 1: a word noun
, verb
, adverb
or adjective
(.*?)
- Group 2: any 0+ chars as few as possible, up to (but excluding) the first occurrence of:(?=(?:noun|verb|adverb|adjective|$))
- either noun
, verb
, adverb
, adjective
or end of string (as it is a positive lookahead, (?=...)
, the texts matched do not become part of a match).In Python, use with re.findall
:
re.findall(r'(?s)(noun|verb|adverb|adjective)(.*?)(?=(?:noun|verb|adverb|adjective|$))', s)
Upvotes: 1
Reputation: 16942
This is not a job for a regular expression. What you are trying to match is too variable.
Write a proper grammar for the dictionary entry, as if it were a programming language, and then parse your data according to that grammar.
Like this:
noun
, verb
, adverb
.As a sometime lexicographer I would also recommend that you should treat labels like dialect
(which are generally drawn from a standard vocabulary) as optional keywords rather than as part of the definition.
Upvotes: 2
Reputation: 61498
Probably the easiest thing will be to re.split
the string on the part-of-speech pattern first: re.split('(noun|adjective|verb|adverb)', s)
. For the provided input, this include an empty item at the start, and then the rest will alternate between part-of-speech labels and the bits in between, which you can then process further.
Upvotes: 1