Reputation: 4301
The following code is a regex for finding substring from a dataframe.
How to modify the regex as follows:
if x[0] is an English letter, that is, [a-zA-Z], then keep the first `\b`, else remove it
AND
if x[-1] is an English letter, that is, [a-zA-Z], then keep the last `\b`, else remove it
for k, v in keyword.items():
pat = '|'.join(r"\b{}\b".format(x) for x in v)
df[str(k)] = df['string'].str.contains(pat).astype(int)
String = 'BEAUTY Company is good, 歡迎~~YOU, SALE'
BEA: not match
Com: not match
歡迎: match
SALE: match
Thank you.
Upvotes: 1
Views: 1271
Reputation: 627292
You may use
pat = r'(?!(?<=[A-Za-z])[A-Za-z])(?:{})(?<data:image/s3,"s3://crabby-images/6dcda/6dcda8823cbac774b7005f229fbc1f2091d289b5" alt="A-Za-z")'.format("|".join([re.escape(x) for x in v]))
See the online regex demo.
The main thing here is the lookarounds, (?!(?<=[A-Za-z])[A-Za-z])
and (?<data:image/s3,"s3://crabby-images/6dcda/6dcda8823cbac774b7005f229fbc1f2091d289b5" alt="A-Za-z")
.
The (?!(?<=[A-Za-z])[A-Za-z])
is a negative lookahead that fails the match if, immediately to the right of the current location (i.e. the first char of the keyword) is an ASCII letter that is preceded with another ASCII letter (checked with the positive lookbehind (?<=[A-Za-z])
).
The (?<data:image/s3,"s3://crabby-images/6dcda/6dcda8823cbac774b7005f229fbc1f2091d289b5" alt="A-Za-z")
is a negative lookbehind that fails the match if, immediately to the left of the current location (i.e. the last char of the keyword) is an ASCII letter that is followed with another ASCII letter (checked with the positive lookahead (?=[A-Za-z])
).
Note that you do not have to add these lookarounds to each alternative in the regex, just use them to enclose a (?:...|...)
like alternation group that you may build dynamically as I have shown above.
Also, [re.escape(x) for x in v]
is handy if any of the keywords can contain special regex chars that should be treated as literal chars.
import re
s = 'BEAUTY Company is good, 歡迎~~YOU, SALE'
v = ['BEA','Com','歡迎','SALE']
pat = r'(?!(?<=[A-Za-z])[A-Za-z])(?:{})(?<data:image/s3,"s3://crabby-images/6dcda/6dcda8823cbac774b7005f229fbc1f2091d289b5" alt="A-Za-z")'.format("|".join([re.escape(x) for x in v]))
print(re.findall(pat, s)) # => ['歡迎', 'SALE']
Upvotes: 3
Reputation: 373
you can do like this
import re
if (re.search(r'[a-zA-Z]',x[0]):
print(x[0])
else:
x = x[1:]
if(re.search(r'[a-zA-Z]',x[-1]):
print(x[-1])
else:
x = x[:-1]
Upvotes: 1