Reputation: 213
I am trying to create a python script which can parse the following type of log entry which comprises of keys and values. For each key, there may or may not be another nested pair of keys and values. An example is as below. THe depth of the nesting can vary depeding on the log i get so it has to be dynamic. THe depth is however encapsulated with braces.
The string I will have with keys and values are something like this:
Countries = {
"USA" = 0;
"Spain" = 0;
Connections = 1;
Flights = {
"KLM" = 11;
"Air America" = 15;
"Emirates" = 2;
"Delta" = 3;
};
"Belgium" = 1;
"Czech Republic" = 0;
"Netherlands" = 1;
"Hungary" = 0;
"Luxembourg" = 0;
"Italy" = 0;
};
THe data above can have multiple nests as well. I would like to write a function that will parse through this and put it in an array of data (or similar) such that I could get a the value of a specific key like:
print countries.belgium
value should be printed as 1
likewise,
print countries.flights.delta
value should be printed as 3.
Note that the input doesnt need to have quotes in all the keys (like connections or flights).
Any pointers to what I can start with. Any python libraries that can already do some parsing like this?
Upvotes: 0
Views: 1213
Reputation: 671
I have created a sample python script that will do the job, just tweak it as your like. It converts you format into a nested dict. And it is as dynamic as you like.
Take a look at here: Paste bin Code:
import re
import ast
data = """ { Countries = { USA = 1; "Connections" = { "1 Flights" = 0; "10 Flights" = 0; "11 Flights" = 0; "12 Flights" = 0; "13 Flights" = 0; "14 Flights" = 0; "15 Flights" = 0; "16 Flights" = 0; "17 Flights" = 0; "18 Flights" = 0; "More than 25 Flights" = 0; }; "Single Connections" = 0; "No Connections" = 0; "Delayed" = 0; "Technical Fault" = 0; "Others" = 0; }; }"""
def arrify(string):
string = string.replace("=", " : ")
string = string.replace(";", " , ")
string = string.replace("\"", "")
stringDict = string.split()
# print stringDict
newArr = []
quoteCosed = True
for i, splitStr in enumerate(stringDict):
if i > 0:
# print newArr
if not isDelim(splitStr):
if isDelim(newArr[i-1]) and quoteCosed:
splitStr = "\"" + splitStr
quoteCosed = False
if isDelim(stringDict[i+1]) and not quoteCosed:
splitStr += "\""
quoteCosed = True
newArr.append(splitStr)
newString = " ".join(newArr)
newDict = ast.literal_eval(newString)
return normalizeDict(newDict)
def isDelim(string):
return str(string) in "{:,}"
def normalizeDict(dic):
for key, value in dic.items():
if type(value) is dict:
dic[key] = normalizeDict(value)
continue
dic[key] = normalize(value)
return dic
def normalize(string):
try:
return int(string)
except:
return string
print arrify(data)
The result from your sample data:
{'Countries': {'USA': 1, 'Technical Fault': 0, 'No Connections': 0, 'Delayed': 0, 'Connections': {'17 Flights': 0, '10 Flights': 0, '11 Flights': 0, 'More than 25 Flights': 0, '14 Flights': 0, '15 Flights': 0, '12 Flights': 0, '18 Flights': 0, '16 Flights': 0, '1 Flights': 0, '13 Flights': 0}, 'Single Connections': 0, 'Others': 0}}
And you can get values like a normal dict would :) hope it helps ...
Upvotes: 1
Reputation: 910
Defining a Class structure to process and store the information, could give you something like this:
import re
class datastruct():
def __init__(self,data_in):
flights = re.findall('(?:Flights\s=\s*\{)([\s"A-Z=0-9;a-z]*)};',data_in)
flight_dict = {}
for flight in flights[0].split(';')[0:-1]:
key,val = self.split_data(flight)
flight_dict[key] = val
countries = re.findall('("[A-Za-z]+\s?[A-Za-z]*"\s=\s[0-9]{1,2})',data_in)
countries_dict = {}
for country in countries:
key,val = self.split_data(country)
if key not in flight_dict:
countries_dict[key]=val
connections = re.findall('(?:Connections\s=\s)([0-9]*);',data_in)
self.country= countries_dict
self.flight = flight_dict
self.connections = int(connections[0])
def split_data(self,data2):
item = data2.split('=')
key = item[0].strip().strip('"')
val = int(item[1].strip())
return key,val
Please note the Regex may need tweaking if the data is not exactly as I've assumed below. The data could be set-up and referenced as follows:
raw_data = 'Countries = { "USA" = 0; "Spain" = 0; Connections = 1; Flights = { "KLM" = 11; "Air America" = 15; "Emirates" = 2; "Delta" = 3; }; "Belgium" = 1; "Czech Republic" = 0; "Netherlands" = 1; "Hungary" = 0; "Luxembourg" = 0; "Italy" = 0;};'
flight_data = datastruct(raw_data)
print("No. Connections:",flight_data.connections)
print("Country 'USA':",flight_data.country['USA'],'\n'
print("Flight 'KLM':",flight_data.flight['KLM'],'\n')
for country in flight_data.country.keys():
print("Country: {0} -> {1}".format(country,flight_data.country[country]))
Upvotes: 1
Reputation: 1002
Iterate over the data and check if the element is another key-value pair, If it is, then call the function recursively. Something like this:
def parseNestedData(data):
if isinstance(data, dict):
for k in data.keys():
parseNestedData(data.get(k))
else:
print data
Output:
>>> Countries = {
"USA" : 0,
"Spain" : 0,
"Connections" : 1,
"Flights" : {
"KLM" : 11,
"Air America" : 15,
"Emirates" : 2,
"Delta" : 3,
},
"Belgium" : 1,
"Czech Republic" : 0,
"Netherlands" : 1,
"Hungary" : 0,
"Luxembourg" : 0,
"Italy" :0
};
>>> Countries
{'Connections': 1,
'Flights': {'KLM': 11, 'Air America': 15, 'Emirates': 2, 'Delta': 3},
'Netherlands': 1,
'Italy': 0,
'Czech Republic': 0,
'USA': 0,
'Belgium': 1,
'Hungary': 0,
'Luxembourg': 0, 'Spain': 0}
>>> parseNestedData(Countries)
1
11
15
2
3
1
0
0
0
1
0
0
0
Upvotes: 1