Matan D
Matan D

Reputation: 141

Censoring strings in file using regex in python

I use python to open files and replace specific regular expression patterns. I have a list of files, and a list of patterns/strings that need to be censored.

Currently, I am iterating each file and each line of in every file, checking if there is a match to the patterns and if so - replace it:

# Removing all IP and patterns
for logPath in createdLogs:
    file = fileinput.FileInput(logPath, inplace=True)
    for line in file:
        line = re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",
                      "XXX.XXX.XXX.XXX",
                      line.rstrip()) # Censoring IPs
        for pattern in patterns:
            line = re.sub(pattern, "HIDDEN-TEXT", line.rstrip()) # Censoring other patterns
        print line
    file.close()

The problem is efficiency. This code takes a lot of time to run when iterating more than 5 files (around 15-20).

Any recommendation for a more efficient way to do the same process?

Upvotes: 0

Views: 318

Answers (2)

Regexr
Regexr

Reputation: 56

You could try setting up a thread pool and using multithreading to improve performance rates. The link below does a really good job of introducing you to the basics for python:

http://chriskiehl.com/article/parallelism-in-one-line/

You could put the parallelism in multiple spots. You'll have to play with what works best for you. It might be more efficient to treat each file as a unique process, or it might be better to put it around the process of checking each line individually.

However, one final way you could speed this up is to avoid going line by line. Instead, put the regular expression on the entire file and replace all matches at once. @Idlehands answer gives more detail into this.

Upvotes: 2

r.ook
r.ook

Reputation: 13888

Might be just a marginal improvement:

patterns = {"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}": "XXX.XXX.XXX.XXX", "http://[\w.:]+": "HIDDEN-TEXT"}

for logPath in createdLogs:
    file = fileinput.FileInput(logPath, inplace=True)
    lines = ''.join(file.readlines())
    for pattern, new_text in patterns.items():
        lines = re.sub(pattern, new_text, lines)
    print(lines)
    file.close()

input:

foo 127.0.0.1 http://localhost:8080 bar
foo 192.168.1.1 http://my.router.com bar
foo 192.168.1.100 http://my.computer.net bar
foo 192.168.100.1 foo.bar bar
foo 255.255.255.0 default.gateway.dns bar
foo 172.217.0.228 www.google.com bar
foo 151.101.65.69 www.stackoverflow.com bar

Output:

foo XXX.XXX.XXX.XXX HIDDEN-TEXT bar
foo XXX.XXX.XXX.XXX HIDDEN-TEXT bar
foo XXX.XXX.XXX.XXX HIDDEN-TEXT bar
foo XXX.XXX.XXX.XXX foo.bar bar
foo XXX.XXX.XXX.XXX default.gateway.dns bar
foo XXX.XXX.XXX.XXX www.google.com bar
foo XXX.XXX.XXX.XXX www.stackoverflow.com bar

Changes:
1.) Pre-define all the patterns and replaced text in a dictionary.
2.) Iterate through all the patterns instead of censoring the IP separately.
3.) Instead of checking per line, do only one re.sub() per pattern, per file.

If anything, I think #3 is the key in eliminating processing time

Upvotes: 1

Related Questions