Darren Lanigan
Darren Lanigan

Reputation: 15

How to isolate only the first space in a string using regex?

I have a foreign language to English dictionary that I'm trying to import into a sql database. This dictionary is in a text file and the lines look like this:

field1 field2 [romanization] /definition 1/definition 2/definition 3/

I'm using regex in python to identify the delimiters. So far I've been able to isolate every delimiter except for the space in-between field 1 and field 2.

(?<=\S)\s\[|\]\s/(?=[A-Za-z])|/
#(?<=\S)\s\[  is the opening square bracket after field 2
#\]\s/(?=[A-Za-z]) is the closing square bracket after the romanization
#/ is the forward slashes in-between definitions.
#????????? is the space between field 1 and field two

Upvotes: 1

Views: 73

Answers (2)

user557597
user557597

Reputation:

If Python supports the \K construct, this will work.
This construct is a poor mans version of a variable length lookbehind.

 # (?m)(?:^[^\s\[\]/]+\K\s|(?<=\S)\s\[|\]\s/(?=[A-Za-z])|/)

 (?m)
 (?:
      ^ [^\s\[\]/]+ 
      \K 
      \s 
   |  
      (?<= \S )
      \s \[
   |  
      \] \s /
      (?= [A-Za-z] )
   |  
      /
 )

Apparently, Python does not have this construct, but might support
variable length lookbehind's with their experimental regex module.

http://pypi.python.org/pypi/regex

 # (?m)(?:(?<=^[^\s\[\]/]+)\s|(?<=\S)\s\[|\]\s/(?=[A-Za-z])|/)

 (?m)
 (?:
      (?<= ^ [^\s\[\]/]+ )
      \s 
   |  
      (?<= \S )
      \s \[
   |  
      \] \s /
      (?= [A-Za-z] )
   |  
      /
 )

Upvotes: 2

Finwood
Finwood

Reputation: 3991

You could try this regex, it isolates all fields and delimiters:

import re

preg = re.compile(r'^(?P<field1>\S+)(?P<delim1>\s+)'
                  r'(?P<field2>\S+)(?P<delim2>\s+)'
                  r'\[(?P<romanization>\S+)\](?P<delim3>\s+)'
                  r'/(?P<def1>[^/]+)/(?P<def2>[^/]+)/(?P<def3>[^/]+)')
lines = ['field1 field2 [romanization] /def 1/def 2/def 3/',
         'Foo Bar  [Foobar]\t/stuff/content/nonsense/']

for line in lines:
    m = preg.match(line)
    if m is not None:
        print(m.groupdict())

Your first delimiter, for example, would be in m.group('delim1').

Upvotes: 0

Related Questions