MHibbin
MHibbin

Reputation: 1185

Python: Check one element in csv, use another to remove from second file

I am trying to get a script working, where it will check the existance of an IP in a lookup csv file, and then if it exists take the third element and remove that third element from another (second) file. Here is a extract of what I have:

for line in fileinput.input(hostsURLFileLoc,inplace =1):
        elements = open(hostsLookFileLoc, 'r').read().split(".").split("\n")
        first = elements[0].strip()
        third = elements[2].strip()
        if first == hostIP:
                if line != third:
                        print line.strip()

This obviously doesn't work, I have tried playing with a few options, but here is my latest (crazy) attempt.

I think the problem is that there are two input files open at once.

Any thoughts welcome,

Cheers

Upvotes: 0

Views: 1003

Answers (3)

Blckknght
Blckknght

Reputation: 104762

All right, even though I haven't got any response to my comment on the question, here's my shot at a general answer. If I've got something wrong, just say so and I'll edit to try to address the errors.

First, here are my assumptions. You have two files, who's names are stored in the HostsLookFileLoc and HostsURLFileLoc variables.

The file at HostsLookFileLoc is a CSV file, with an IP address in the third column of each row. Something like this:

HostsLookFile.csv:

blah,blah,192.168.1.1,whatever,stuff
spam,spam,82.94.164.162,eggs,spam
me,myself,127.0.0.1,and,I
...

The file at HostsURLFileLoc is a flat text file with one IP address per line, like so:

HostsURLFile.txt:

10.1.1.2
10.1.1.3
10.1.2.253
127.0.0.1
8.8.8.8
192.168.1.22
82.94.164.162
64.34.119.12
...

Your goal is to read and then rewrite the HostsURLFile.txt file, excluding all of the IP addresses that are found in the third column of a row in the CSV file. In the example lists above, localhost (127.0.0.1) and python.org (82.94.164.162) would be excluded, but the rest of the IPs in the list would remain.

Here's how I'd do it, in three steps:

  1. Read in the CSV file and parse it using the csv module to find the IP addresses. Stick them into a set.
  2. Open the flat file and read the IP addresses into a list, closing the file afterwards.
  3. Reopen the flat file and overwrite it with the loaded list of addresses, skipping any that are contained in the set from the first step.

Code:

import csv

def cleanURLFile(HostsLookFileLoc, HostsURLFileLoc):
    """
    Remove IP addresses from file at HostsURLFileLoc if they are in
    the third column of the file at HostsLookFileLoc.
    """
    with open(HostsLookFileLoc, "r") as hostsLookFile:
        reader = csv.reader(hostsLookFile)
        ipsToExclude = set(line[2].strip() for line in reader)

    with open(HostsURLFileLoc, "r") as hostsURLFile:
        ipList = [line.strip() for line in hostsURLFile]

    with open(HostsURLFileLoc, "w") as hostsURLFile: # truncates the file!
        hostsURLFile.write("\n".join(ip for ip in ipList
                                     if ip not in ipsToExclude))

This code is deliberately simple. There are a few things that could be improved, if they are important to your use case:

  • If something crashes the program during the rewriting step, HostsURLFile.txt may be clobbered. A safer way of rewriting (at least, on Unix-style systems) is to write to a temp file, then after the writing has finished (and the file has been closed), rename the temp file over the top of the old file. That way if the program crashes, you'll still have the original version or a completely written replacement, but never anything in between.
  • If the checking you needed to do was more complicated than set membership, I'd add an extra step between 2 and 3 to do the actual processing, then write the results out without further manipulation (other than adding newlines).
  • Speaking of newlines, if you have a trailing newline, it will be passed through as an empty string in the list of IP addresses, which should be OK for this scenario (it won't be in the set of IPs to exclude, unless your CSV file has a messed up line), but might cause trouble if you were doing something more complicated.

Upvotes: 5

algorowara
algorowara

Reputation: 1720

If, once you have found values in the first file that need to be removed in the second file, I suggest something like this pseudocode:

Load first file into memory
Search string representing first file for matches using a regular expression
    (in python, check for re.find(regex, string), where regex = re.compile("[0-9]{3}\\.[0-9]{3}\\.[0-9]\\.[0-9]"), I am not entirely certain that you need the double backslash here, try with and without)
Build up a list of all matches
Exit first file

Load second file into memory
Search string representing second file for the start index and end index of each match
For each match, use the expression string = string[:start_of_match] + string[end_of_match:]
Re-write the string representing the second (now trimmed) file to the second file

Essentially whenever you find a match, redefine the string to be the slices on either side of it, excluding it from the new string assignment. Then rewrite your string to a file.

Upvotes: 0

Caleb Hattingh
Caleb Hattingh

Reputation: 9235

In test file test.csv (note there is an IP address in there):

'aajkwehfawe;fh192.168.0.1awefawrgaer'

(I am pretty much ignoring that it is CSV for now. I am just going to use regex matches.)

# Get the file data
with open('test.csv', 'r') as f:
    data = f.read()

# Look for the IP:
find_ip = '192.168.0.1'
import re
m = re.search('[^0-9]({})[^0-9]'.format(find_ip), data)
if m: # found!
    # this is weird, because you already know the value in find_ip, but anyway...
    ip = m.group(1).split('.')
    print('Third ip = ' + ip[2])
else:
    print('Did not find a match for {}'.format(find_ip))

I do not understand the second part of your question, i.e. removing the third value from a second file. Are there numbers listed line by line, and you want to find the line that contains this number above and delete the line? If yes:

# Make a new list of lines that omits the matched one
new_lines=[]
for line in open('iplist.txt','r'):
    if line.strip()!=ip[2]: # skip the matched line
        new_lines.append(line)

# Replace the file with the new list of lines
with open('iplist.txt', 'w') as f:
    f.write('\n'.join(new_lines))

Upvotes: 0

Related Questions