zhuguowei
zhuguowei

Reputation: 8487

About how to find all desired format in a str

I have a text like this format,

s = '[aaa]foo[bbb]bar[ccc]foobar'

Actually the text is Chinese car review like this

【最满意】整车都很满意,最满意就是性价比,...【空间】空间真的超乎想象,毫不夸张,...【内饰】内饰还可以吧,没有多少可以说的...

Now I want to split it to these parts

[aaa]foo
[bbb]bar
[ccc]foobar

first I tried

>>> re.findall(r'\[.*?\].*?',s)
['[aaa]', '[bbb]', '[ccc]']

only got first half.

Then I tried

>>> re.findall(r'(\[.*?\].*?)\[?',s)
['[aaa]', '[bbb]', '[ccc]']

still only got first half

At last I have to get the two parts respectively then zip them

>>> re.findall(r'\[.*?\]',s)
['[aaa]', '[bbb]', '[ccc]']

>>> re.split(r'\[.*?\]',s)
['', 'foo', 'bar', 'foobar']

>>> for t in zip(re.findall(r'\[.*?\]',s),[e for e in re.split(r'\[.*?\]',s) if e]):
...    print(''.join(t))
...
[aaa]foo
[bbb]bar
[ccc]foobar

So I want to know if exists some regex could directly split it to these parts?

Upvotes: 1

Views: 64

Answers (6)

Axalix
Axalix

Reputation: 2871

I think if input string format is "strict enough", it's possible to try something w/o regexp. It may look as a microoptimisation, but could be interesting as a challenge.

result = map(lambda x: '[' + x, s[1:].split("["))

So I tried to check performance on a 1Mil iterations and here are my results (seconds):

result = map(lambda x: '[' + x, s[1:].split("[")) # 0.89862203598
result = re.findall(r'\[[^]]+\][^\[\]]+', s) # 1.48306798935
result = re.findall(r'\[.+?\]\w+', s) # 1.47224497795
result = re.findall(r'(\[\w*\]\w+)', s) # 1.47370815277

Upvotes: 0

Aaditya Ura
Aaditya Ura

Reputation: 12679

All you need is findall and here is very simple pattern without making it complicated:

import re
print(re.findall(r'\[\w+\]\w+','[aaa]foo[bbb]bar[ccc]foobar'))

output:

['[aaa]foo', '[bbb]bar', '[ccc]foobar']

Detailed solution:

import re
string_1='[aaa]foo[bbb]bar[ccc]foobar'
pattern=r'\[\w+\]\w+'
print(re.findall(pattern,string_1))

explanation:

\[\w+\]\w+


\[ matches the character [ literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed 

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

One of the approaches:

import re

s = '[aaa]foo[bbb]bar[ccc]foobar'
result = re.findall(r'\[[^]]+\][^\[\]]+', s)

print(result)

The output:

['[aaa]foo', '[bbb]bar', '[ccc]foobar']

  • \[ or \] - matches the bracket literally
  • [^]]+ - matches one or more characters except ]
  • [^\[\]]+ - matches any character(s) except brackets \[\]

Upvotes: 2

dani herrera
dani herrera

Reputation: 51685

Here it is:

>>> re.findall(r"(\[\w*\]\w+)",s)
['[aaa]foo', '[bbb]bar', '[ccc]foobar']

Explanation:

  • parenthesis means the group to search. Witch group:
  • it should start by a braked \[ followed by some letters \w
  • then the matched braked braked \] followed by more letters \w

Notice you should to escape braked with \.

Upvotes: 1

Mario R.
Mario R.

Reputation: 101

\[.*?\][a-zA-Z]*

This regex should capture anything that start with [somethinghere]Any letters from a to Z

you can play on regex101 to try out different ones and it's easy to make your own regex there

Upvotes: 0

eLRuLL
eLRuLL

Reputation: 18799

I think this could work:

r'\[.+?\]\w+'

Upvotes: 1

Related Questions