myeewyee
myeewyee

Reputation: 767

Python truncate string at regex defined index

I have a list of strings such as

2007 ford falcon xr8 ripcurl bf mkii utility 5.4l v8 cyl 6 sp manual bionic 
2004 nissan x-trail ti 4x4 t30 4d wagon 2.5l 4 cyl 5 sp manual twilight 
2002 subaru liberty rx my03 4d sedan 2.5l 4 cyl 5 sp manual silver 

I want to truncate the string at either the engine capacity (5.4l, 2.5l) or body type (4d wagon, 4d sedan), whichever comes first. So output should be:

2007 ford falcon xr8 ripcurl bf mkii utility
2004 nissan x-trail ti 4x4 t30 
2002 subaru liberty rx my03

I figure I will create a list of words with .split(' '). However, my problem is how to stop at a x.xl or xd word where x could be any number. What sort of regex would pick this up?

Upvotes: 2

Views: 136

Answers (2)

vks
vks

Reputation: 67968

^.*?(?=\s*\d+d\s+(?:wagon|sedan)|\s*\d+(?:\.\d+)?l)

You can use this.See demo.

https://regex101.com/r/aC0uK6/1

import re
p = re.compile(ur'^.*?(?=\s*\d+d\s+(?:wagon|sedan)|\s*\d+(?:\.\d+)?l)', re.MULTILINE)
test_str = u"2007 ford falcon xr8 ripcurl bf mkii utility 5.4l v8 cyl 6 sp manual bionic \n2004 nissan x-trail ti 4x4 t30 4d wagon 2.5l 4 cyl 5 sp manual twilight \n2002 subaru liberty rx my03 4d sedan 2.5l 4 cyl 5 sp manual silver "

re.findall(p, test_str)

Upvotes: 1

alecxe
alecxe

Reputation: 473863

One option would be to replace everything starting from the word that has a number followed by l or a number followed by d followed by wagon or sedan, with an empty string using re.sub():

>>> import re
>>>
>>> l = ["2007 ford falcon xr8 ripcurl bf mkii utility 5.4l v8 cyl 6 sp manual bionic ", "2004 nissan x-trail ti 4x4 t30 4d wagon 2.5l 4 cyl 5 sp manual twilight ", "2002 subaru liberty rx my03 4d sedan 2.5l 4 cyl 5 sp manual silver"]
>>> for item in l:
...     print(re.sub(r"(\b[0-9.]+l\b|\d+d (?:wagon|sedan)).*$", "", item))
... 
2007 ford falcon xr8 ripcurl bf mkii utility 
2004 nissan x-trail ti 4x4 t30 
2002 subaru liberty rx my03 

where:

  • \b[0-9.]+l\b would match a word that has one more digits or dots ending with l
  • \d+d (?:wagon|sedan) would match one or more digits followed by a letter d followed by a space and a wagon or sedan; (?:...) means a non-capturing group

Upvotes: 2

Related Questions