user2772983
user2772983

Reputation: 43

Quick parsing of large files in python

I have requirement to parse a large file(> 1GB). the lines are of following format.

2014-03-11- 00.02.46.149069 TBegin(EventId="XXXX",RequestId="Request",SrvName="service",TxTime="TransactionTime") ... ... End_TxTime New state for EntityId = 'XXXX' new state set to 'DONE' EventId = 'XXXX' RequestId = Request

I have to perform two operations- 1)Parse for file for specific service and record request and beginning TransactionTime 2)Parse the file again based on RequestId and record ending transactionTime

My code is provided below.

    requestId={}
    request_arry=[]
    start_time={}
    end_time={}
    f= open(sys.argv[2],"r")
    for line in f:
        searchObj1=re.search(r'.*RequestId="(.*)",SrvName="%s.*TxTime="(.*)"\)' % service,line,re.M)
        if searchObj1:
            if searchObj1.group(1) in requestId:
                pass
        else:
             requestId[searchObj1.group(1)]=i
             request_arry.append(searchObj1.group(1))
             start_time[searchObj1.group(1)]=searchObj1.group(2)
             i=i+1
        searchObj2=re.search(r'.*new state set to(.*).*RequestId = \'(.{16}).*',line,re.M)
        if searchObj2:
             if searchObj2.group(2) in requestId:
             end_time[searchObj2.group(2)]=line[:26]

The above code works fine but it takes 20 mins to parse 1GB of data. Is there any method to make this faster..?? If i can get this result in half the time it will be really helpful.. Kindly advice

Upvotes: 0

Views: 121

Answers (1)

Ramchandra Apte
Ramchandra Apte

Reputation: 4079

re.search(r'.*RequestId="(.*)",SrvName="%s.*TxTime="(.*)"\)' % service,line,re.M)

Here if service keeps changing it might be better to use a group .* and then after matching check whether that group is equal to service so that Python doesn't have to compile a new regex every time.

Use i+=1 rather than i = i+1 (this might be a micro optimization but it's cleaner code anyways).

Upvotes: 2

Related Questions