UserYmY
UserYmY

Reputation: 8554

python: extract items of different lists and put them in one set

I have a file like this:

93.93.203.11|["['vmit.it', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'maurominnella.com']"]
168.144.9.16|["['iipmalumni.com','webdesignhostingindia.com', 'iipmstudents.in', 'iipmclubs.in']"]
195.211.72.88|["['tcmpraktijk-jingshen.nl', 'ellen-siemer.nl'']"]
129.35.210.118|["['israelinnovation.co.il', 'watec-peru.com', 'bsacimeeting.org', 'wsava2015.com', 'picsmeeting.com']"]

I want to extract domains in all the lists and add them to one set. ultimately, i would like to have a fine with each unique domain in one line. Here is the code I have written:

set_d = set()
f = open(file,'r')
for line in f:
    line = line.strip('\n')
    ip,list = line.split('|')
    l = json.loads(list)
    for e in l:
        domain = e.split(',')
        set_d.add(domain)
        print set_d

but it gives the below error:

    set_d.add(domain)
TypeError: unhashable type: 'list'

Can anybody help me out?

Upvotes: 1

Views: 115

Answers (3)

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

Use str.translate to clean the text and add to the set using update:

set_d = set()
with open(file,'r') as f:
    for line in f:
       lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","
        set_d.update(lst)

outputs a unique set of individual domains:

set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'watec-peru.com', 'bsacimeeting.org', 'webdesignhostingindia.com', 'wsava2015.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'iipmalumni.com', 'iipmclubs.in', 'israelinnovation.co.il'])

which you can write to a new file:

set_d = set()
with open(file,'r') as f,open("out.txt","w") as out:
    for line in f:
        lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","))
        set_d.update(lst)
    for line in set_d:
        out.write("{}\n".format(line))

The output:

$ cat out.txt 
vmit.it
tcmpraktijk-jingshen.nl
umbertominnella.it
studioguizzardi.it
telestreet.it
watec-peru.com
bsacimeeting.org
webdesignhostingindia.com
wsava2015.com
iipmstudents.in
maurominnella.com
ellen-siemer.nl
picsmeeting.com
iipmalumni.com
iipmclubs.in
israelinnovation.co.il

Your code will not separate into individual domains, your json call does not really do anything to help. Changing your code to update will output something like the following:

{" 'maurominnella.com']", " 'wsava2015.com'", "'webdesignhostingindia.com'", " 'iipmclubs.in']", " 'ellen-siemer.nl'']", " 'umbertominnella.it'", " 'picsmeeting.com']", "['israelinnovation.co.il'", "['vmit.it'", " 'iipmstudents.in'", "['tcmpraktijk-jingshen.nl'", " 'studioguizzardi.it'", "['iipmalumni.com'", " 'watec-peru.com'", " 'bsacimeeting.org'", " 'telestreet.it'"}

Also don't use list as a variable name either it shadows the python list

Upvotes: 1

Kasravnd
Kasravnd

Reputation: 107287

As the result of split function is a list (domain = e.split(','))and lists are unhashable you cant add them to set . instead you can add those elements to your set with set.update() , But you dont need Json as it doesn't separate your domain and doesn't give you the desire result instead you can use ast.literal_eval to split your list :

import ast
set_d = set()
f = open(file,'r')
for line in f:
    line = line.strip('\n')
    ip,li = line.split('|')
    l = ast.literal_eval(ast.literal_eval(li)[0])
    for e in l:
        domain = e.split(',')
        set_d.update(domain)
    print set_d

Note that dont use of python built-in functions or types as your variable!

And as a more efficient way you just can use regex to grub your domains :

f = open(file,'r').read()
import re
print set(re.findall(r'[a-zA-Z\-]+\.[a-zA-Z]+',f))

result:

set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'israelinnovation.co', 'bsacimeeting.org', 'webdesignhostingindia.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'watec-peru.com', 'iipmalumni.com', 'iipmclubs.in'])
[Finished in 0.0s]

Upvotes: 0

Ozgur Vatansever
Ozgur Vatansever

Reputation: 52153

You should call update instead of add;

set_d.update(domain)

Example;

>>> set_d = {'a', 'b', 'c'}
>>> set_d.update(['c', 'd', 'e'])
>>> print set_d
{'a', 'b', 'c', 'd', 'e'}

Upvotes: 1

Related Questions