Reputation: 1777
I have a pattern like this in a txt file:
["kiarix moreno","116224357500406255237","z120gbkosz2oc3ckv23bc10hhwrudlcjy04",1409770337,"com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https","es"]
and I need a regex to extract each field in python. Every field can contain any character (not only alphanumeric) except for the 4th which is a long number. How can I do it? Many thanks.
EDIT: the file contains other html elements, that's why I can't parse it directly in a python List.
Upvotes: 1
Views: 1506
Reputation: 20553
I'm going to combine re
, try/except
, ast.literal_eval
and file to read all possible elements, also to avoid any [ ] across several lines so readline won't work.
Here is my solution:
import re
import ast
# grab all possible lists in the file
found = re.findall(r'\[.*\]', open('yourfile.txt' ,'r').read())
for each in found:
try:
for el in ast.literal_eval(each):
print el
except SyntaxError:
pass
kiarix moreno
116224357500406255237
z120gbkosz2oc3ckv23bc10hhwrudlcjy04
1409770337
com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https
es
Upvotes: 0
Reputation: 22001
The following provides three different options for getting your data:
>>> TEXT = '["kiarix moreno","116224357500406255237","z120gbkosz2oc3ckv23bc10hhwrudlcjy04",1409770337,"com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https","es"]'
>>> import json, ast, re
>>> json.loads(TEXT)
['kiarix moreno', '116224357500406255237', 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 1409770337, 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'es']
>>> ast.literal_eval(TEXT)
['kiarix moreno', '116224357500406255237', 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 1409770337, 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'es']
>>> re.search(r'\["(?P<name>[^"]*)","(?P<number1>[^"]*)","(?P<data>[^"]*)",(?P<number2>\d*),"(?P<website>[^"]*)","(?P<language>[^"]*)"\]', TEXT).groupdict()
{'website': 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'number2': '1409770337', 'language': 'es', 'data': 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 'number1': '116224357500406255237', 'name': 'kiarix moreno'}
>>>
In particular, your regular expression would be the following: r'\["(?P<name>[^"]*)","(?P<number1>[^"]*)","(?P<data>[^"]*)",(?P<number2>\d*),"(?P<website>[^"]*)","(?P<language>[^"]*)"\]'
Upvotes: 1
Reputation: 643
you can 1)open the file. 2)use getline to scan each line. 3)use split() function to split using "," and then use the resulting tuple/list however you want.
Upvotes: 0
Reputation: 67968
"([^"]*")|(\d+)
You can try this.Grab the matches.See demo.
http://regex101.com/r/dK1xR4/5
Upvotes: 0