Reputation: 632
I have following input in the log file which I am interested to capture all the part of IDs, however it won't return me the whole of the ID and just returns me some part of that:
id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤
id:A2uhasan30hamwix160212145302428
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧
id:A2uhasan30hamwix160207145023750
I have used the following regular expression with python 2.7:
I have edited sid to id:
RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9._+]*))', re.U)
to
>>> RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U)
>>> sid = RE_SID.search('id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤').group('sid')
>>> sid
'A2uhasan30hamwix'
and this is my result:
is: A2uhasan30hamwix
After edit: This is how I am reading the log file:
with open(cfg.log_file) as input_file: ...
fields = line.strip().split(' ')
and an example of line in log:
2015-11-30T23:58:13.760950+00:00 calxxx enexxxxce[10476]: INFO consume_essor: user:<<"ailxxxied">> callee_num:<<"+144442567413">> id:<<"A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧">> credits:0.0 result:ok provider:sipovvvv1.yv.vs
I will appreciated to help me to edit my regular expression.
Upvotes: 3
Views: 118
Reputation: 627101
Based on what we discussed in the chat, posting the solution:
import codecs
import re
RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U) # \d used to match non-ASCII digits, too
input_file = codecs.open(cfg.log_file, encoding='utf-8') # Read the file with UTF8 encoding
for line in input_file:
fields = line.strip().split(u' ') # u prefix is important!
if len(fields) >= 11:
try:
# ......
sid = RE_SID.search(fields[7]).group('sid') # Or check if there is a match first
Upvotes: 1
Reputation: 588
string = '''
id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤
id:A2uhasan30hamwix160212145302428
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧
id:A2uhasan30hamwix160207145023750
'''
import re
reObj = re.compile(r'id:.*')
ans = reObj.findall(string,re.DOTALL)
print(ans)
Output :
['id:A2uhasan30hamwix160212145302428 ',
'id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ ',
'id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ ',
'id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ ',
'id:A2uhasan30hamwix160207145023750']
Upvotes: 0
Reputation: 474031
3 things to fix:
id
instead of sid
\d
instead of 0-9
to also catch the arabic numeralssid
named groupFixed version:
id:(<<")?(?P<sid>[A-Za-z\d_.+]+)
Upvotes: 1