Reputation: 8554
I have a file like this:
93.93.203.11|["['vmit.it', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'maurominnella.com']"]
168.144.9.16|["['iipmalumni.com','webdesignhostingindia.com', 'iipmstudents.in', 'iipmclubs.in']"]
195.211.72.88|["['tcmpraktijk-jingshen.nl', 'ellen-siemer.nl'']"]
129.35.210.118|["['israelinnovation.co.il', 'watec-peru.com', 'bsacimeeting.org', 'wsava2015.com', 'picsmeeting.com']"]
I want to extract domains in all the lists and add them to one set. ultimately, i would like to have a fine with each unique domain in one line. Here is the code I have written:
set_d = set()
f = open(file,'r')
for line in f:
line = line.strip('\n')
ip,list = line.split('|')
l = json.loads(list)
for e in l:
domain = e.split(',')
set_d.add(domain)
print set_d
but it gives the below error:
set_d.add(domain)
TypeError: unhashable type: 'list'
Can anybody help me out?
Upvotes: 1
Views: 115
Reputation: 180411
Use str.translate to clean the text and add to the set using update:
set_d = set()
with open(file,'r') as f:
for line in f:
lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","
set_d.update(lst)
outputs a unique set of individual domains:
set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'watec-peru.com', 'bsacimeeting.org', 'webdesignhostingindia.com', 'wsava2015.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'iipmalumni.com', 'iipmclubs.in', 'israelinnovation.co.il'])
which you can write to a new file:
set_d = set()
with open(file,'r') as f,open("out.txt","w") as out:
for line in f:
lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","))
set_d.update(lst)
for line in set_d:
out.write("{}\n".format(line))
The output:
$ cat out.txt
vmit.it
tcmpraktijk-jingshen.nl
umbertominnella.it
studioguizzardi.it
telestreet.it
watec-peru.com
bsacimeeting.org
webdesignhostingindia.com
wsava2015.com
iipmstudents.in
maurominnella.com
ellen-siemer.nl
picsmeeting.com
iipmalumni.com
iipmclubs.in
israelinnovation.co.il
Your code will not separate into individual domains, your json call does not really do anything to help. Changing your code to update will output something like the following:
{" 'maurominnella.com']", " 'wsava2015.com'", "'webdesignhostingindia.com'", " 'iipmclubs.in']", " 'ellen-siemer.nl'']", " 'umbertominnella.it'", " 'picsmeeting.com']", "['israelinnovation.co.il'", "['vmit.it'", " 'iipmstudents.in'", "['tcmpraktijk-jingshen.nl'", " 'studioguizzardi.it'", "['iipmalumni.com'", " 'watec-peru.com'", " 'bsacimeeting.org'", " 'telestreet.it'"}
Also don't use list as a variable name either it shadows the python list
Upvotes: 1
Reputation: 107287
As the result of split
function is a list (domain = e.split(',')
)and lists are unhashable you cant add them to set
. instead you can add those elements to your set with set.update()
, But you dont need Json
as it doesn't separate your domain and doesn't give you the desire result instead you can use ast.literal_eval
to split your list :
import ast
set_d = set()
f = open(file,'r')
for line in f:
line = line.strip('\n')
ip,li = line.split('|')
l = ast.literal_eval(ast.literal_eval(li)[0])
for e in l:
domain = e.split(',')
set_d.update(domain)
print set_d
Note that dont use of python built-in functions or types as your variable!
And as a more efficient way you just can use regex to grub your domains :
f = open(file,'r').read()
import re
print set(re.findall(r'[a-zA-Z\-]+\.[a-zA-Z]+',f))
result:
set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'israelinnovation.co', 'bsacimeeting.org', 'webdesignhostingindia.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'watec-peru.com', 'iipmalumni.com', 'iipmclubs.in'])
[Finished in 0.0s]
Upvotes: 0
Reputation: 52153
You should call update
instead of add
;
set_d.update(domain)
Example;
>>> set_d = {'a', 'b', 'c'}
>>> set_d.update(['c', 'd', 'e'])
>>> print set_d
{'a', 'b', 'c', 'd', 'e'}
Upvotes: 1