Reputation: 79
I'm trying to match a shortName-field from a JSON'ish string (no longer in correct JSON format, thus regex). Running regex here might not be the most efficient way. I'm open for suggestions, but I WANT the solution for the original problem as well.
I'm using Python 2.7 and Scrapy, running PyCharm 2018.2
What I want: Get matches from the huge JSON'ish file full of restaurants, run every match into list, iterate the list objects and collect different fields data, which I set into variables for future use. We don't go that far here though.
I want to match the shortName-field, and pull out the value/data from it.
The code samples below start from the point where the huge file is already received (in unicode or string), and we start to match for restaurant specific data fields. In the actual pattern, I tried to escape, and not to escape, the " and : symbols.
What I have: Regex101 (below)
I got the actual regex which I'm trying to fix, which ends up in "NoneType has no attribute 'group'".
Do note, the first line "pattern" works, and brings me the data which I start to go through in for-loop. I don't believe that the problem lies there.
regex = re.compile(pattern, re.MULTILINE)
for match in regex.finditer(r.text):
restaurant = match.group()
restaurant = str(restaurant)
print restaurant
print type(restaurant)
name = re.search(r'(?<=shortName\":\")(.*?)(?=\")',restaurant,re.MULTILINE
| re.DOTALL).group()
Source sample:
156,"mainGroupId":1,"menuTypeId":1,"shopExternalId":"0001","displayName":"Lorem Ipsum","shortName":"I WANT THIS TEXT HERE","streetAddress":"BlankStreet 5","zip":"1211536","city":"Wonderland",
Testing regex, which works for a fixed source sample. NOTE: The source sample for this one was formatted with \ by regex101, as I first had every " and : escaped with . I copied this straight from their code generator, but it does work in code:
testregex = r'(?<=shortName\"\:\")(.*?)(?=\")'
test_str = (
156,\"mainGroupId\":1,\"menuTypeId\":1,\"shopExternalId\":\"0001\",\"displayName\":\"Lorem Ipsum\",\"shortName\":\"I CAN GET THIS MATCHED \",\"streetAddress\":\"BlankStreet 6\",\"zip\":\"2136481\",\"city\":\"Wonderland\")
matches = re.search(testregex, test_str, re.MULTILINE | re.DOTALL).group()
print matches
restaurantname = matches
What is the problem: The upper regex prints out the "'nonetype' object has no attribute 'group'"-error. The lower regex gets me the data I want, in this example it prints out "I CAN GET THIS MATCHED"
I am well aware that there might be small syntax problems, as I've been trying to fix this for some time.
Thank you in advance. The more detailed answer, the better. If you got different approach to the problem, please do give code so I can learn from it.
Upvotes: 1
Views: 121
Reputation: 627103
Your regex does not match your string. There is no shopID
in the input.
You may get all your restaurant names directly with one re.findall
call using the following regex:
shortName":"([^"]+)
See the regex demo. Details
shortName":"
- a literal substring([^"]+)
- Capturing group 1 (the result of the re.findall
call will be the substrings captured into this Group): 1 or more chars other than "
.See Python demo:
import re
regex = re.compile(r'shortName":"([^"]+)')
print(regex.findall('156,"mainGroupId":1,"menuTypeId":1,"shopExternalId":"0001","displayName":"Lorem Ipsum","shortName":"I WANT THIS TEXT HERE","streetAddress":"BlankStreet 5","zip":"1211536","city":"Wonderland",'))
Upvotes: 2