Observer
Observer

Reputation: 651

Find the similar texts across the python dataframe

Suppose I have a python dataframe as follows,

data['text']

abc.google.com
d-2667808233512566908.ampproject.net
d-27973032622323999654.ampproject.net
def.google.com
d-28678547673442325000.ampproject.net
i1-j4-20-1-1-13960-2081004232-s.init.cedexis-radar.net
d-29763453703185417167.ampproject.net
poi.google.com
d-3064948553577027059.ampproject.net
i1-io-0-4-1-20431-1341659986-s.init.cedexis-radar.net
d-2914631797784843280.ampproject.net
i1-j1-18-24-1-11326-1053733564-s.init.cedexis-radar.net

I want to find the similar common texts and group it. for example, abc.google.com, def.google.com, poi.google.com will point to google.com and etc.

The required output is,

google.com
ampproject.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
s.init.cedexis-radar.net

It's more like a data cleaning exercise where I can clean the unwanted parts. One way is to manually inspect and code for every possible group. But I would be having millions of text. So is there a way / package in python to do this?

Sorry for asking this without trying anything. I've tried to research on this without much success. Not sure how I should start. If anybody can let me know the approach that needs to be taken also, it would be helpful for me.

Thanks

Upvotes: 0

Views: 154

Answers (1)

Wasi Ahmad
Wasi Ahmad

Reputation: 37691

For cleaning, you can use regular expression if you are sure what will be specific format of the text segments in your dataset.

Another approach is trying to match common patterns. For example, in many text segments, you have google.com. You can use this information while pre-processing.

Example

lines = ['abc.google.com',
         'd-2667808233512566908.ampproject.net',
         'd-27973032622323999654.ampproject.net',
         'def.google.com',
         'd-28678547673442325000.ampproject.net',
         'i1-j4-20-1-1-13960-2081004232-s.init.cedexis-radar.net',
         'd-29763453703185417167.ampproject.net',
         'poi.google.com',
         'd-3064948553577027059.ampproject.net',
         'i1-io-0-4-1-20431-1341659986-s.init.cedexis-radar.net',
         'd-2914631797784843280.ampproject.net',
         'i1-j1-18-24-1-11326-1053733564-s.init.cedexis-radar.net']


def commonSubstringFinder(string1, string2):
    common_substring = ""
    split1 = string1.split('.')
    split2 = string2.split('.')
    index1 = len(split1) - 1
    index2 = len(split2) - 1
    size = 0
    while index1 >= 0 & index2 >= 0:
        if split1[index1] == split2[index2]:
            if common_substring:
                common_substring = split1[index1] + '.' + common_substring
            else:
                common_substring += split1[index1]
            size += 1
        else:
            ind1 = len(split1[index1]) - 1
            ind2 = len(split2[index2]) - 1
            if split1[index1][ind1] == split2[index2][ind2]:
                common_substring = '.' + common_substring
            while ind1 >= 0 & ind2 >= 0:
                if split1[index1][ind1] == split2[index2][ind2] and split1[index1][ind1].isalpha():
                    if common_substring:
                        common_substring = split1[index1][ind1] + common_substring
                    else:
                        common_substring += split1[index1][ind1]
                else:
                    break
                ind1 -= 1
                ind2 -= 1

            break
        index1 -= 1
        index2 -= 1

    if size > 1:
        return common_substring
    else:
        return ""

output = []
for line in lines:
    flag = True
    for i in range(len(output)):
        result = commonSubstringFinder(output[i], line)
        if len(result) > 0:
            output[i] = result
            output.append(result)
            flag = False
            break
    if flag:
        output.append(line)

for item in output:
    print(item)

This outputs:

google.com
ampproject.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
s.init.cedexis-radar.net

Upvotes: 1

Related Questions