Matrix166
Matrix166

Reputation: 79

Two very similar Regex, other could not find match

I'm trying to match a shortName-field from a JSON'ish string (no longer in correct JSON format, thus regex). Running regex here might not be the most efficient way. I'm open for suggestions, but I WANT the solution for the original problem as well.

I'm using Python 2.7 and Scrapy, running PyCharm 2018.2

What I want: Get matches from the huge JSON'ish file full of restaurants, run every match into list, iterate the list objects and collect different fields data, which I set into variables for future use. We don't go that far here though.

I want to match the shortName-field, and pull out the value/data from it.

The code samples below start from the point where the huge file is already received (in unicode or string), and we start to match for restaurant specific data fields. In the actual pattern, I tried to escape, and not to escape, the " and : symbols.

What I have: Regex101 (below)

I got the actual regex which I'm trying to fix, which ends up in "NoneType has no attribute 'group'".

Do note, the first line "pattern" works, and brings me the data which I start to go through in for-loop. I don't believe that the problem lies there.

regex = re.compile(pattern, re.MULTILINE)
for match in regex.finditer(r.text):
  restaurant = match.group()
  restaurant = str(restaurant)
  print restaurant
  print type(restaurant)

  name = re.search(r'(?<=shortName\":\")(.*?)(?=\")',restaurant,re.MULTILINE 
  | re.DOTALL).group()

Source sample:

156,"mainGroupId":1,"menuTypeId":1,"shopExternalId":"0001","displayName":"Lorem Ipsum","shortName":"I WANT THIS TEXT HERE","streetAddress":"BlankStreet 5","zip":"1211536","city":"Wonderland",

Testing regex, which works for a fixed source sample. NOTE: The source sample for this one was formatted with \ by regex101, as I first had every " and : escaped with . I copied this straight from their code generator, but it does work in code:

testregex = r'(?<=shortName\"\:\")(.*?)(?=\")'

test_str = (


156,\"mainGroupId\":1,\"menuTypeId\":1,\"shopExternalId\":\"0001\",\"displayName\":\"Lorem Ipsum\",\"shortName\":\"I CAN GET THIS MATCHED \",\"streetAddress\":\"BlankStreet 6\",\"zip\":\"2136481\",\"city\":\"Wonderland\")

matches = re.search(testregex, test_str, re.MULTILINE | re.DOTALL).group()
print matches
restaurantname = matches

What is the problem: The upper regex prints out the "'nonetype' object has no attribute 'group'"-error. The lower regex gets me the data I want, in this example it prints out "I CAN GET THIS MATCHED"

I am well aware that there might be small syntax problems, as I've been trying to fix this for some time.

Thank you in advance. The more detailed answer, the better. If you got different approach to the problem, please do give code so I can learn from it.

Upvotes: 1

Views: 121

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

Your regex does not match your string. There is no shopID in the input.

You may get all your restaurant names directly with one re.findall call using the following regex:

shortName":"([^"]+)

See the regex demo. Details

  • shortName":" - a literal substring
  • ([^"]+) - Capturing group 1 (the result of the re.findall call will be the substrings captured into this Group): 1 or more chars other than ".

See Python demo:

import re
regex = re.compile(r'shortName":"([^"]+)')
print(regex.findall('156,"mainGroupId":1,"menuTypeId":1,"shopExternalId":"0001","displayName":"Lorem Ipsum","shortName":"I WANT THIS TEXT HERE","streetAddress":"BlankStreet 5","zip":"1211536","city":"Wonderland",'))

Upvotes: 2

Related Questions