How to find and order duplicates in data sets

Question

I have two data sets which I read in with pandas Dataframe. Let's call them set 1 and set 2. set 1 and set 2 contain text documents. Some of the text documents in set 1 occur in set 2, and I am looking for a way to find those duplicates. I first thought about using sets which would return a list of all the elements in the intersection of the data sets.

set_1 = set(set_1)
set_2 = set(set_2)
duplicates = set1.intersection(set_2)

However, there is one extra thing I need to do. The duplicates should be in the order of set 2. Why? Well set 1 has a bunch of data examples and labels which I use as a training set, and set 2 is my test set. But if a given example is a duplicate I want to assign the labels from the same example in set 1 instead of predicting the label.

So in pseudocode: Do you guys have an idea on how I could do this?

duplicates = set_1.intersection(set_2)
for example in set_2:
    if example in duplicates:
        assign labels from set_1 to example
    else:
        predict the labels

Edit

since the first part of my question might be confusing, the pseudocode is actually what I am really looking for. So if you find my explanation of the problem above confusing, just have a look at my pseudocode for an summary of what I want to achieve:

The pseudocode

    duplicates = set_1.intersection(set_2)
    for example in set_2:
        if example in duplicates:
            assign labels from set_1 to example
        else:
            predict the labels

Unfortunately, I have to go now so I cannot respond to any comments right away, but when I get back I'll respond.

Update:

This is part of my actual code: first I read in the train and test sets as pandas data frame objects and convert them to numpy arrays so that I am able to access the individual columns.

 train = pd.read_csv(os.path.join(dir,"Train.csv"))
 test = pd.read_csv(os.path.join(dir,"Test.csv"))

#getting train and test sets and the labels 
 train = np.array(train)[:,2]                                              
 test = np.array(test)[:,2]
 labels = np.array(train)[:,3]

My idea was to obtain a list of duplicates to check wether a test example is a duplicate, so I convert the train and test sets into sets to get the duplicates.

train = set(train)
test = set(test)
duplicates = train.intersection(test)

From this point I am not sure how I should proceed. My goal is to assign labels to the duplicate samples, those labels should come from the train set. All other samples should get labels assigned by my estimator (machine learning algorithm).

So in short: Once again the data I am working with are text documents. I have text documents that occur both in the train and the test set, my train set has labels assigned to every example. For every duplicate in my test set I need to find the duplicate example in my train set and more precisely it's corresponding label. I need to assign that label to the test example. All the non duplicates in my test set should be predicted by my machine learning algorithm.

Roberto · Accepted Answer

Ok, I have edited, see if this is more or less what you need:

set_1 = [["yes", 1], ["maybe", 1], ["never", 0], ["nopes", 0], ["si", 1]]
set_2 = ["of course", "yes", "always", "never", "no way", "no"]

def predict_label(item):
    return 2 # just to check which items got predicted

dset_1 = dict(set_1)

labeled_set_2 = [[item, dset_1.get(item, predict_label(item))] for item in set_2]
print labeled_set_2

this will preserve the order in set_2 as you requested. But check if my assumptions on the structures of set_1 and set_2 are right.

It will give as result:

[['of course', 2], ['yes', 1], ['always', 2], ['never', 0], ['no way', 2], ['no', 2]]

This list comprehension creates a new list made of pairs (lists here, but you can use tuples if you like). The key idea is to make a dictionary out of set_1, so you can use the dictionary get method to find out if a key is present or not. By the use of get, if the key is not present, the value will default to the value returned by predict_label(item). So the list comprehension will go through all the items in set_2, and check if they exist as keys in the dictionary. If they do, the second item in the pair will be the value of the dictionary entry for that item. If it is false, the second item will be calculated by predict_label(item).

This other code does the same thing, with a for loop inside a function instead of a list comprehension:

set_1 = [["yes", 1], ["maybe", 1], ["never", 0], ["nopes", 0], ["si", 1]]
set_2 = ["of course", "yes", "always", "never", "no way", "no"]

def predict_label(item):
    return 2 # just to check which items got predicted

def labeled_set(set1, set2):
    dset_1 = dict(set1)
    labeled_set_2 = []
    for item in set2:
        if item in dset_1.keys():
            labeled_set_2.append([item, dset_1[item]])
        else:
            labeled_set_2.append([item, predict_label(item)])
    return labeled_set_2

print labeled_set(set_1, set_2)

This gives the same result. In this case I used dset_1.keys() so no need to use the get method.

How to find and order duplicates in data sets

Answers (1)

Related Questions