Reputation: 471
I have a data frame:
d1 ={'letters':['ABCDE[NOT]FGH', 'CCGF[NOT]HI', 'MPJ[NOT]L', 'MNA[NOT]PLJKAJSHD']}
df1 = pd.DataFrame(d1)
df1:
letters
0 ABCDE[NOT]FGH
1 CCGF[NOT]HI
2 MPJ[NOT]L
3 MNA[NOT]PLJKAJSHD
I want to create a space between each character except for those between the [ ].
Desired output:
letters
0 A B C D E [NOT] F G H
1 C C G F [NOT] H I
2 M P J [NOT] L
3 M N A [NOT] P L J K A J S H D
I have tried:
matching = re.sub(r'[^a-zA-Z []]+(?![^{]*})(\w)', r'\1', i)
df1['letters'].apply(lambda x: matching)
But this does not seem to work. any ideas?
Upvotes: 2
Views: 118
Reputation: 627103
You can append a space to each [...]
substring or any other char found in the string and then rstrip
the result:
>>> df1['letters'].str.replace(r'\[[^][]*]|.', r'\g<0> ', regex=True).str.rstrip()
0 A B C D E [NOT] F G H
1 C C G F [NOT] H I
2 M P J [NOT] L
3 M N A [NOT] P L J K A J S H D
Name: letters, dtype: object
See this regex demo.
Another way is to add spaces around any char other than those that are matched with the \[[^][]*]
pattern, and then str.strip()
the result:
>>> df1['letters'].str.replace(r'(\[[^][]*])|.', lambda x: x.group(1) if x.group(1) else f" {x.group()} ", regex=True).str.strip()
0 A B C D E [NOT] F G H
1 C C G F [NOT] H I
2 M P J [NOT] L
3 M N A [NOT] P L J K A J S H D
Name: letters, dtype: object
The (\[[^][]*])|.
regex matches and captures into Group 1 a [
, then any zero or more chars other than [
and ]
and then a ]
char, or any char other than a line break char, and replaces with the Group 1 value if it was captured or with "space" + match value + "space" otherwise.
The str.strip()
removes leading/trailing spaces if any arise from the replacing operation.
Non-Pandas code
import re
# Solution 1
text = re.sub(r'\[[^][]*]|.', r'\g<0> ', text).rstrip()
# Solution 1
text = re.sub(r'(\[[^][]*])|.', lambda x: x.group(1) if x.group(1) else f" {x.group()} ", text).strip()
Upvotes: 3
Reputation: 71461
You can use re.findall
:
import pandas as pd, re
d1 = {'letters':['ABCDE[NOT]FGH', 'CCGF[NOT]HI', 'MPJ[NOT]L', 'MNA[NOT]PLJKAJSHD']}
df1 = pd.DataFrame(d1)
df1['letters'] = df1['letters'].apply(lambda x:' '.join(re.findall('\[\w+\]|\w', x)))
letters
0 A B C D E [NOT] F G H
1 C C G F [NOT] H I
2 M P J [NOT] L
3 M N A [NOT] P L J K A J S H D
Upvotes: 0
Reputation: 20747
Albeit wildly inefficient, you could use this and avoid post-processing:
(?=(?!^)[^\[\]]*?\[|[^\[\]]+$)
(?=
- start a lookahead
(?!^)
- do not assert the start of the string[^\[\]]*?\[
- assert any position leading up to a opening bracket [
|
- or[^\[\]]+$
- assert any position which is not a bracket leading up to the end of the line)
- close the lookaheadhttps://regex101.com/r/zoHEne/1/
Note: The regex101 example has trailing spaces only because of the multiline. Test each line one at a time to see no trailing spaces.
Upvotes: 0