Reputation: 371
I have a .txt file of IPs, Times, Search Queries, and Websites accessed. I used a for loop to break them up into respective indices of a list, I then placed all these lists, into a larger list.
When printed it may look like this...
['4.16.159.114', '08:13:37', 'french-english dictionary', 'humanities.uchicago.edu/forms_unrest/FR-ENG.html\n']
['4.16.186.203', '00:13:54', 's.e.t.i.', 'www.seti.net/\n']
['4.16.189.59', '05:48:58', 'which is better http upload or ftp upload', 'www.ewebtribe.com/htmlhelp/uploading.htm\n']
['4.16.189.59', '06:50:49', 'cgi perl tutorial', 'www.cgi101.com/class/\n']
['4.16.189.59', '07:16:28', 'cgi perl tutorial', 'www.free-ed.net/fr03/lfc/course%20030207_01/\n']
My code for getting to here looks like so, which is just me scraping this data from a text file, and putting it into a list, then writing to another text file.
import io
f = io.open(r'C:\Users\Ryan Asher\Desktop\%23AlltheWeb_2001.txt', encoding="Latin-1")
p = io.open(r'C:\Users\Ryan Asher\Desktop\workfile.txt', 'w')
sweet = []
for line in f:
x = line.split(" ")
lbreak = x[0].split("\t")
sweet.append(lbreak)
for item in sweet:
p.write("%s\n" % item)
My issue here is the 3rd index in the each list, within the sweet list or [2], which is the search query (french-english dictionary, s.e.t.i, etc.). I do not want multiples in the 'sweet' list.
So where it says 'cgi perl tutorial' but twice, I need to get rid of the other search of 'cgi perl tutorial', and only leave the first one, within the sweet list.
I can't use set for this I don't think, because I only want it to apply to the 3rd index of search queries, and I don't want it to get rid of duplicates of the same IP, or one of the others.
Upvotes: 2
Views: 130
Reputation: 446
Add lbreak[2]
to a set, only append line that lbreak[2]
not in the set, something like:
sweet = []
seen = set()
for line in f:
x = line.split(" ")
lbreak = x[0].split("\t")
if lbreak[2] not in seen:
sweet.append(lbreak)
seen.add(lbreak[2])
Upvotes: 3
Reputation: 11
Use a dict, with the query as the key and the entire list as the value. Something like this (untested):
sweet = {}
for line in f:
...
query = lbreak[2]
if query not in sweet:
sweet[query] = lbreak
If you wanted the last instance of each query instead of the first, you could just lose the if
, and do the assignment unconditionally.
Upvotes: 1