Chip
Chip

Reputation: 471

Python using re to create spaces between characters except for those in square brackets

I have a data frame:

d1  ={'letters':['ABCDE[NOT]FGH', 'CCGF[NOT]HI', 'MPJ[NOT]L', 'MNA[NOT]PLJKAJSHD']}
df1 = pd.DataFrame(d1)

df1:

letters
0   ABCDE[NOT]FGH
1   CCGF[NOT]HI
2   MPJ[NOT]L
3   MNA[NOT]PLJKAJSHD

I want to create a space between each character except for those between the [ ].

Desired output:

letters
0   A B C D E [NOT] F G H
1   C C G F [NOT] H I
2   M P J [NOT] L
3   M N A [NOT] P L J K A J S H D

I have tried:

matching = re.sub(r'[^a-zA-Z []]+(?![^{]*})(\w)', r'\1', i)

df1['letters'].apply(lambda x: matching)

But this does not seem to work. any ideas?

Upvotes: 2

Views: 118

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

You can append a space to each [...] substring or any other char found in the string and then rstrip the result:

>>> df1['letters'].str.replace(r'\[[^][]*]|.', r'\g<0> ', regex=True).str.rstrip()
0            A B C D E [NOT] F G H
1                C C G F [NOT] H I
2                    M P J [NOT] L
3    M N A [NOT] P L J K A J S H D
Name: letters, dtype: object

See this regex demo.

Another way is to add spaces around any char other than those that are matched with the \[[^][]*] pattern, and then str.strip() the result:

>>> df1['letters'].str.replace(r'(\[[^][]*])|.', lambda x: x.group(1) if x.group(1) else f" {x.group()} ", regex=True).str.strip()
0                A  B  C  D  E [NOT] F  G  H
1                      C  C  G  F [NOT] H  I
2                            M  P  J [NOT] L
3    M  N  A [NOT] P  L  J  K  A  J  S  H  D
Name: letters, dtype: object

The (\[[^][]*])|. regex matches and captures into Group 1 a [, then any zero or more chars other than [ and ] and then a ] char, or any char other than a line break char, and replaces with the Group 1 value if it was captured or with "space" + match value + "space" otherwise.

The str.strip() removes leading/trailing spaces if any arise from the replacing operation.

Non-Pandas code

import re
# Solution 1
text = re.sub(r'\[[^][]*]|.', r'\g<0> ', text).rstrip()

# Solution 1
text = re.sub(r'(\[[^][]*])|.', lambda x: x.group(1) if x.group(1) else f" {x.group()} ", text).strip()

Upvotes: 3

Ajax1234
Ajax1234

Reputation: 71461

You can use re.findall:

import pandas as pd, re
d1 = {'letters':['ABCDE[NOT]FGH', 'CCGF[NOT]HI', 'MPJ[NOT]L', 'MNA[NOT]PLJKAJSHD']}
df1 = pd.DataFrame(d1)
df1['letters'] = df1['letters'].apply(lambda x:' '.join(re.findall('\[\w+\]|\w', x)))
                         letters
0          A B C D E [NOT] F G H
1              C C G F [NOT] H I
2                  M P J [NOT] L
3  M N A [NOT] P L J K A J S H D

Upvotes: 0

MonkeyZeus
MonkeyZeus

Reputation: 20747

Albeit wildly inefficient, you could use this and avoid post-processing:

(?=(?!^)[^\[\]]*?\[|[^\[\]]+$)
  • (?= - start a lookahead
    • (?!^) - do not assert the start of the string
    • [^\[\]]*?\[ - assert any position leading up to a opening bracket [
    • | - or
    • [^\[\]]+$ - assert any position which is not a bracket leading up to the end of the line
  • ) - close the lookahead

https://regex101.com/r/zoHEne/1/

Note: The regex101 example has trailing spaces only because of the multiline. Test each line one at a time to see no trailing spaces.

Upvotes: 0

Related Questions