Reputation: 4850
I'm trying to delete duplicate entries from data that look like this:
name phone email website
Diane Grant Albrecht M.S.
Lannister G. Cersei M.A.T., CEP 111-222-3333 [email protected] www.got.com
Argle D. Bargle Ed.M.
Sam D. Man Ed.M. 000-000-1111 [email protected] www.daManWithThePlan.com
Sam D. Man Ed.M.
Sam D. Man Ed.M. 111-222-333 [email protected] www.daManWithThePlan.com
D G Bamf M.S.
Amy Tramy Lamy Ph.D.
So that it looks like this:
name phone email website
Diane Grant Albrecht M.S.
Lannister G. Cersei M.A.T., CEP 111-222-3333 [email protected] www.got.com
Argle D. Bargle Ed.M.
Sam D. Man Ed.M. 000-000-1111, 111-222-333 [email protected] www.daManWithThePlan.com
D G Bamf M.S.
Amy Tramy Lamy Ph.D.
Here's my code:
from collections import defaultdict
import csv
import re
input = open('ieca_first_col_fake_text.txt', 'rU')
# default to empty set for phone, email, website, area, degrees
extracted_data = defaultdict(lambda: [set(), set(), set()])
for row in input:
for index, value in enumerate(row):
name = row[0]
data = extracted_data[name].add(row)
for row in data: print row
I get this error:
AttributeError: 'list' object has no attribute 'add'
logout
UPDATE:
from collections import defaultdict
import csv
import re
input = open('ieca_first_col_fake_text.txt', 'rU')
input_r = csv.reader(input, delimiter = '\t')
# default to empty set for phone, email, website, area, degrees
extracted_data = defaultdict(lambda: [set(), set(), set()])
data = []
# Index on the name and then for that name add the rest of the information.
for row in input_r:
data_set = extracted_data[row[0]]
for index, value in enumerate(row[1:]):
data_set[index].add(value)
print data_set
output:
[set(['']), set(['']), set([''])]
logout
Upvotes: 0
Views: 2256
Reputation: 1121854
extracted_data
values are lists of 3 sets each:
extracted_data = defaultdict(lambda: [set(), set(), set()])
You need to read the previous answer more closely and pick the right set to call .add()
on.
The previous answer loops over 4 elements in your input line, uses the first element to find the list of sets, and adds each of the other 3 elements to those sets:
for index, value in enumerate(split(entry)):
if index == 0:
data_set = extracted_data[name]
elif value:
data_set[index - 1].add(value)
Personally, I'd use:
entry = entry.split() # split on whitespace
for value, dset in zip(entry[1:], extracted_data[entry[0]]):
dset.add(value)
to achieve the same thing.
Upvotes: 3