SignalProcessed
SignalProcessed

Reputation: 371

Removing duplicates without set()

I have a .txt file of IPs, Times, Search Queries, and Websites accessed. I used a for loop to break them up into respective indices of a list, I then placed all these lists, into a larger list.

When printed it may look like this...

['4.16.159.114', '08:13:37', 'french-english dictionary', 'humanities.uchicago.edu/forms_unrest/FR-ENG.html\n']
['4.16.186.203', '00:13:54', 's.e.t.i.', 'www.seti.net/\n']
['4.16.189.59', '05:48:58', 'which is better http upload or ftp upload', 'www.ewebtribe.com/htmlhelp/uploading.htm\n']
['4.16.189.59', '06:50:49', 'cgi perl tutorial', 'www.cgi101.com/class/\n']
['4.16.189.59', '07:16:28', 'cgi perl tutorial', 'www.free-ed.net/fr03/lfc/course%20030207_01/\n']

My code for getting to here looks like so, which is just me scraping this data from a text file, and putting it into a list, then writing to another text file.

import io

f = io.open(r'C:\Users\Ryan Asher\Desktop\%23AlltheWeb_2001.txt', encoding="Latin-1")
p = io.open(r'C:\Users\Ryan Asher\Desktop\workfile.txt', 'w')

sweet = [] 

for line in f:
    x = line.split("     ")
    lbreak = x[0].split("\t")
    sweet.append(lbreak)

for item in sweet:
    p.write("%s\n" % item)

My issue here is the 3rd index in the each list, within the sweet list or [2], which is the search query (french-english dictionary, s.e.t.i, etc.). I do not want multiples in the 'sweet' list.

So where it says 'cgi perl tutorial' but twice, I need to get rid of the other search of 'cgi perl tutorial', and only leave the first one, within the sweet list.

I can't use set for this I don't think, because I only want it to apply to the 3rd index of search queries, and I don't want it to get rid of duplicates of the same IP, or one of the others.

Upvotes: 2

Views: 130

Answers (2)

citaret
citaret

Reputation: 446

Add lbreak[2] to a set, only append line that lbreak[2] not in the set, something like:

sweet = [] 
seen = set()

for line in f:
    x = line.split("     ")
    lbreak = x[0].split("\t")
    if lbreak[2] not in seen:
        sweet.append(lbreak)
        seen.add(lbreak[2])

Upvotes: 3

Eric Siegerman
Eric Siegerman

Reputation: 11

Use a dict, with the query as the key and the entire list as the value. Something like this (untested):

sweet = {}

for line in f:
    ...
    query = lbreak[2]
    if query not in sweet:
        sweet[query] = lbreak

If you wanted the last instance of each query instead of the first, you could just lose the if, and do the assignment unconditionally.

Upvotes: 1

Related Questions