phcaze
phcaze

Reputation: 1777

Regex to extract multiple fields from pattern

I have a pattern like this in a txt file:

["kiarix moreno","116224357500406255237","z120gbkosz2oc3ckv23bc10hhwrudlcjy04",1409770337,"com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https","es"]

and I need a regex to extract each field in python. Every field can contain any character (not only alphanumeric) except for the 4th which is a long number. How can I do it? Many thanks.

EDIT: the file contains other html elements, that's why I can't parse it directly in a python List.

Upvotes: 1

Views: 1506

Answers (4)

Anzel
Anzel

Reputation: 20553

I'm going to combine re, try/except, ast.literal_eval and file to read all possible elements, also to avoid any [ ] across several lines so readline won't work.

Here is my solution:

import re
import ast

# grab all possible lists in the file
found = re.findall(r'\[.*\]', open('yourfile.txt' ,'r').read())

for each in found:
    try:
        for el in ast.literal_eval(each):
            print el
    except SyntaxError:
        pass


kiarix moreno
116224357500406255237
z120gbkosz2oc3ckv23bc10hhwrudlcjy04
1409770337
com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https
es

Upvotes: 0

Noctis Skytower
Noctis Skytower

Reputation: 22001

The following provides three different options for getting your data:

>>> TEXT = '["kiarix moreno","116224357500406255237","z120gbkosz2oc3ckv23bc10hhwrudlcjy04",1409770337,"com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https","es"]'
>>> import json, ast, re
>>> json.loads(TEXT)
['kiarix moreno', '116224357500406255237', 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 1409770337, 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'es']
>>> ast.literal_eval(TEXT)
['kiarix moreno', '116224357500406255237', 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 1409770337, 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'es']
>>> re.search(r'\["(?P<name>[^"]*)","(?P<number1>[^"]*)","(?P<data>[^"]*)",(?P<number2>\d*),"(?P<website>[^"]*)","(?P<language>[^"]*)"\]', TEXT).groupdict()
{'website': 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'number2': '1409770337', 'language': 'es', 'data': 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 'number1': '116224357500406255237', 'name': 'kiarix moreno'}
>>> 

In particular, your regular expression would be the following: r'\["(?P<name>[^"]*)","(?P<number1>[^"]*)","(?P<data>[^"]*)",(?P<number2>\d*),"(?P<website>[^"]*)","(?P<language>[^"]*)"\]'

Upvotes: 1

rahul tyagi
rahul tyagi

Reputation: 643

you can 1)open the file. 2)use getline to scan each line. 3)use split() function to split using "," and then use the resulting tuple/list however you want.

Upvotes: 0

vks
vks

Reputation: 67968

"([^"]*")|(\d+)

You can try this.Grab the matches.See demo.

http://regex101.com/r/dK1xR4/5

Upvotes: 0

Related Questions