stanleyerror
stanleyerror

Reputation: 788

Optimize regex and file reading operation on big files in python

Right now I have two large files, pattern file and log file, each of them has over 300,000 lines. The pattern file is of this format:

Line 1 : <ID>   <Dialog1>    <ReplyStr>    <Dialog2>    
// the ReplyStr is needed as a pattern

The log file is of this format:

Line 1 : <LogData>    <ReplyStr>    <CommentOfReply>   
// get all CommentOfReply, whose ReplyStr is from the pattern file

My task is to get all comment from specific replies, for analyzing the user's emotion to these given replies. So this is what I do step-by-step:

  1. to pick out all patterns and logs, both of them using regex,
  2. then match them together with string compare operation.

I need to optimize the code, for now it took 8 hours to finished.

The profile is following (using cProfile on first 10 loops):

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   19.345   19.345 <string>:1(<module>)
        1    7.275    7.275   19.345   19.345 get_candidate2.py:12(foo)
  3331494    2.239    0.000   10.772    0.000 re.py:139(search)
  3331496    4.314    0.000    5.293    0.000 re.py:226(_compile)
      7/2    0.000    0.000    0.000    0.000 sre_compile.py:32(_compile)
                            ......
  3331507    0.632    0.000    0.632    0.000 {method 'get' of 'dict' objects}
  3331260    0.560    0.000    0.560    0.000 {method 'group' of '_sre.SRE_Match' objects}
        2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'remove' of 'list' objects}
  3331494    3.241    0.000    3.241    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
        9    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
  6662529    0.737    0.000    0.737    0.000 {method 'strip' of 'str' objects}

From the profile, it seems all the time consuming is from the re.search(). I have no idea how to reduce it.

Upvotes: 2

Views: 578

Answers (1)

stanleyerror
stanleyerror

Reputation: 788

Thanks to the help from @MikeSatteson and @tobias_k, I figure it out.

To pick out all the comment string (from log file) corresponding to given reply string (from pattern file), the solution is:

  1. a dict is needed, whose key is reply string, and value is a list of comment string.
  2. pick out all the reply string from pattern file, as the key set of a dict.
  3. pick out all the reply-comment pair from log file, if the dict's key set contains the reply, append the comment to the comment list.

Here is the code:

my_dict = {}
with open('pattern file', 'r') as pattern_file:
    for line in pattern_file:
        reply = get_reply(line)
        my_dict[reply] = list()     

with open('log file', 'r') as log_file:
    for line in log_file:
        pair = get_comment_reply_pair(line)
        reply = pair.reply
        comment  = pair.comment
        if reply in my_dict:
            l = my_dict[reply]
            l.append(comment)

Upvotes: 2

Related Questions