Reputation: 106
I've been running around internet trying to find out how to build a regular expression to capture text in the way I need it; so I saw some StackOverflow questions but none of them express what I want, but if you already saw something similar to my issue here, pelase feel free to pointme to that article...
I tried to use recursion but it seems I'm not good enough to get something to work
Some notes:
1) I can't use a parse program because the program that will use this data will use regular expression to capture it, and this program is a "general purpose" program that in fact is capturing any data that is needed, only thing I need to do is give proper regular expression to get information it needs, also I need to keep it as copact as possible, so I can't use third party or external programs.
2) The pair 'key': 'value' can vary, they are not always the same number of pairs... that is what make it difficult I believe.
3) Program that is going to use this regex is created in Python 2.7.3: How this program works: it uses a Json config file where I can setup command I want to run that will give to me data I need, then I specify a regex to teach to the program what need to be captured and how to handle it ie: what to do with the groups that get captured... so that is why I can't use a parser. This program uses fabric to run configued collector(with the regex) to remote hosts and gather all data...
4) Program is used to gather data to post them into a webserver and get metrics and other stuff like graphs and monitor alarms etc
I have been able to capture almost all data I was planing to capture but when I was trying to create a collector for this then I got stuck..
The following data repeats exactly like below but with different server names, of course values will change too:
Server: Omega-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}
Server: Alfa-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}
How I want to capture it:
Server: Omega-X
transfer_data: 0
factor_a: 0
slow: 0
factor_b: 0
score_retry: 0
damage_factor_c: 0
voice_ud: 0
alarm_factors_bl: 0
telemetry_x: 0
endstream: 0
celery: 0
awl: 0
trx: 0
points: 0
feature_factors_xf: 0
feature_factors_dc: 0
Server: Alfa-X
transfer_data: 0
factor_a: 0
slow: 0
factor_b: 0
score_retry: 0
damage_factor_c: 0
voice_ud: 0
alarm_factors_bl: 0
telemetry_x: 0
endstream: 0
celery: 0
awl: 0
trx: 0
points: 0
feature_factors_xf: 0
feature_factors_dc: 0
If a unique server is shown, then is not so difficult, using the below regex I'm able to capture all (except name of server):
'([a-z_]+)':\s'(\d+)'
This regex will give only the second part, which is the list of variables and values, but not the Server name... so if I get on same output several servers with the same data, then will be impossible to know from which server the values are coming from...
If I try to add support for the server name: I've tried follwoing regex, it works but only capture Server name, and first pair of parameters:
Server:\s([a-zA-Z0-9-]+)\s*celery\.queue_length:\s.('([a-z_]+)':\s'(\d+)')*
I had tried multiple recursion features but I've failed to achieve what I want.
Can anyone point me to right direction here...?
Thanks.
Upvotes: 3
Views: 418
Reputation: 106
thanks guys that kindly responded my question, I think both of you help me to reshape way I'm seeing this issue...
My believe is, what I want to achieve here is very difficult for a regex:
Giving the difficulty of how to get information I want. I was thinking in which way will be easier for me to get this information. So I know I'm going against my own rules here, but I think there's no other way to go smoothly I believe.
If I want to get regex group like:
Server: Group 0
Key : Group 1
Value: Group 2
then output I will need should be like:
Regex Groups:
(0) (1) (2)
Server: Omega-X transfer_data: 0
Server: Omega-X factor_a: 0
Server: Omega-X slow: 0
Server: Omega-X factor_b: 0
Server: Omega-X score_retry: 0
Server: Omega-X damage_factor_c: 0
Server: Omega-X voice_ud: 0
Server: Omega-X alarm_factors_bl: 0
Server: Omega-X telemetry_x: 0
Server: Omega-X endstream: 0
Server: Omega-X celery: 0
Server: Omega-X awl: 0
Server: Omega-X trx: 0
Server: Omega-X points: 0
Server: Omega-X feature_factors_xf: 0
Server: Omega-X feature_factors_dc: 0
In this way I can process any number of servers in the same output without any difficult and using a very simple regex...
"Server:\s([a-zA-Z_.-]+)\s'([a-zA-Z_]+)':\s'(\d+)'"
So I think the best way to go, is adding a Pre-Parser to prepare data like this, and then process it...
In fact, both of you help me on this, much appreciated.
I guess I will close this question unless somebody else as a better idea :)
Upvotes: 0
Reputation: 15559
You can use Antlr, to define your grammer and would be a better option than regular expression: https://dzone.com/articles/antlr-4-with-python-2-detailed-example
If you want to use regular expression, you can use the following, please note my code is in C#, but regular expression should behave the same in Python.
string serverNamePattern = @"(?<=Server(\s)*:(\s))\s*[\w-]+";
string dataPattern = @"(?<=celery.queue_length[\s:]*{)[a-zA-Z0-9\s:\'_,]+";
string input =
"Server: Omega-X" +
"celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}" +
"Server: Alfa-X" +
"celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}";
var serverNames = Regex.Matches(input, serverNamePattern);
var dataMatches = Regex.Matches(input, dataPattern);
Explanation:
+: one or more occurrence
\w: alphanumeric
\s: white space
[]: define a range
(?<=a)b: positive lookbehind, match b that comes after a
(?<=Server(\s):(\s))\s[\w-]+: match alphanumeric,- and white space that comes after Server:
(?<=celery.queue_length[\s:]*{)[a-zA-Z0-9\s:\',]+: match a range of [a-zA-Z0-9':,\s] that comes after celery.queue_length:
Note that you need to add "Server: " before server name. also this does not remove single quotes from the data.
Upvotes: 0
Reputation: 1158
You want key-value ? with python I would use the dictionary.
get the server name and the string containing the data:
Server: ([^\n]*)(?:[^{]*)\{(.*)\}
build a dict with the string containing the data for each server:
With python (you only need import re
statement):
input = """Server: Omega-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}
Server: Alfa-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}"""
for match in re.findall(r'Server: ([^\n]*)(?:[^{]*)\{(.*)\}', input):
server = match[0]
data = match[1]
datadict = dict((k.strip().replace("'", ""), v.strip().replace("'", "")) for k,v in (item.split(':') for item in data.split(',')))
datadict['serveur'] = server
Then you can store each datadict (e.g. in a list) and use then as you want. You can cast the values from string to integer to manipulate them easily.
Upvotes: 1