Reputation: 651
Suppose I have a python dataframe as follows,
data['text']
abc.google.com
d-2667808233512566908.ampproject.net
d-27973032622323999654.ampproject.net
def.google.com
d-28678547673442325000.ampproject.net
i1-j4-20-1-1-13960-2081004232-s.init.cedexis-radar.net
d-29763453703185417167.ampproject.net
poi.google.com
d-3064948553577027059.ampproject.net
i1-io-0-4-1-20431-1341659986-s.init.cedexis-radar.net
d-2914631797784843280.ampproject.net
i1-j1-18-24-1-11326-1053733564-s.init.cedexis-radar.net
I want to find the similar common texts and group it. for example, abc.google.com, def.google.com, poi.google.com will point to google.com and etc.
The required output is,
google.com
ampproject.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
s.init.cedexis-radar.net
It's more like a data cleaning exercise where I can clean the unwanted parts. One way is to manually inspect and code for every possible group. But I would be having millions of text. So is there a way / package in python to do this?
Sorry for asking this without trying anything. I've tried to research on this without much success. Not sure how I should start. If anybody can let me know the approach that needs to be taken also, it would be helpful for me.
Thanks
Upvotes: 0
Views: 154
Reputation: 37691
For cleaning, you can use regular expression if you are sure what will be specific format of the text segments in your dataset.
Another approach is trying to match common patterns. For example, in many text segments, you have google.com
. You can use this information while pre-processing.
Example
lines = ['abc.google.com',
'd-2667808233512566908.ampproject.net',
'd-27973032622323999654.ampproject.net',
'def.google.com',
'd-28678547673442325000.ampproject.net',
'i1-j4-20-1-1-13960-2081004232-s.init.cedexis-radar.net',
'd-29763453703185417167.ampproject.net',
'poi.google.com',
'd-3064948553577027059.ampproject.net',
'i1-io-0-4-1-20431-1341659986-s.init.cedexis-radar.net',
'd-2914631797784843280.ampproject.net',
'i1-j1-18-24-1-11326-1053733564-s.init.cedexis-radar.net']
def commonSubstringFinder(string1, string2):
common_substring = ""
split1 = string1.split('.')
split2 = string2.split('.')
index1 = len(split1) - 1
index2 = len(split2) - 1
size = 0
while index1 >= 0 & index2 >= 0:
if split1[index1] == split2[index2]:
if common_substring:
common_substring = split1[index1] + '.' + common_substring
else:
common_substring += split1[index1]
size += 1
else:
ind1 = len(split1[index1]) - 1
ind2 = len(split2[index2]) - 1
if split1[index1][ind1] == split2[index2][ind2]:
common_substring = '.' + common_substring
while ind1 >= 0 & ind2 >= 0:
if split1[index1][ind1] == split2[index2][ind2] and split1[index1][ind1].isalpha():
if common_substring:
common_substring = split1[index1][ind1] + common_substring
else:
common_substring += split1[index1][ind1]
else:
break
ind1 -= 1
ind2 -= 1
break
index1 -= 1
index2 -= 1
if size > 1:
return common_substring
else:
return ""
output = []
for line in lines:
flag = True
for i in range(len(output)):
result = commonSubstringFinder(output[i], line)
if len(result) > 0:
output[i] = result
output.append(result)
flag = False
break
if flag:
output.append(line)
for item in output:
print(item)
This outputs:
google.com
ampproject.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
s.init.cedexis-radar.net
Upvotes: 1