Reputation: 20882
I have a site where users can put in a description about themselves.
Most users write something appropriate but some just copy/paste the same text a number of times (to create the appearance of a fair amount of text).
eg: "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace"
Is there a good method to detect repetitive text with PHP?
The only concept I currently have would be to break the text into separate words (delimited by space) and then look to see if the word is repeated more then a set limited. Note: I'm not 100% sure how I would code this solution.
Thoughts on the best way to detect duplicate text? Or how to code the above idea?
Upvotes: 23
Views: 14581
Reputation: 12092
You have a tricky problem on your hands, primarily because your requirements are somewhat unclear.
You indicate you want to disallow repeated text, because it's "bad".
Consider someone with who puts the last stanza of Robert Frosts Stopping by Woods on a Snowy Evening in their profile:
These woods are lovely, dark and deep
but I have promises to keep
and miles to go before I sleep
and miles to go before I sleep
You might consider this good, but it does have a repetition. So what's good, and what's bad? (note that this is not an implementation problem just yet, you're just looking for a way to define "bad repetitions")
Directly detecting duplicates thus proves tricky. So let's devolve to tricks.
Compression works by taking redundant data, and compressing it into something smaller. A very repetitive text would be very easily compressed. A trick you could perform, is to take the text, zip it, and take a look at the compression ratio. Then tweak the allowed ratio to something you find acceptable.
implementation:
$THRESHOLD = ???;
$bio = ???;
$zippedbio = gzencode($bio);
$compression_ratio = strlen($zippedbio) / strlen($bio);
if ($compression_ratio >= $THRESHOLD) {
//ok;
} else {
//not ok;
}
A couple of experimental results from examples found in this question/answers:
suggest a threshold value of around 0.6 before rejecting it as too repetitive.
Upvotes: 3
Reputation: 44833
You could use a regex, like this:
if (preg_match('/(.{10,})\\1{2,}/', $theText)) {
echo "The string is repeated.";
}
Explanation:
(.{10,})
looks for and captures a string that is at least 10 characters long\\1{2,}
looks for the first string at least 2 more timesPossible tweaks to suit your needs:
10
to a higher or lower number to match longer or shorter repeated strings. I just used 10
as an example.love and peace love and peace
), delete the {2,}
. If you want to catch a higher number of repetitions, increase the 2
.,
in {2,}
.Upvotes: 16
Reputation: 222501
I am not sure whether it is a good idea to combat such problem. If a person wants to put junk in aboutme field, they will always come up with the idea how to do it. But I will ignore this fact and combat the problem as an algorithmic challenge:
Having a string S, which consists of the substrings (which can appear many times and non-overlapping) find the substring it consist of.
The definition is louse and I assume that the string is already converted to lowercase.
First an easier way:
Use modification of a longest common subsequence which has an easy DP programming solution. But instead of finding a subsequence in two different sequences, you can find longest common subsequence of the string with respect to the same string LCS(s, s)
.
It sounds stupid at the beginning (surely LCS(s, s) == s
), but we actually do not care about the answer, we care about the DP matrix that it get.
Let's look at the example: s = "abcabcabc"
and the matrix is:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 2, 0, 0, 2, 0, 0, 2, 0]
[0, 0, 0, 3, 0, 0, 3, 0, 0, 3]
[0, 1, 0, 0, 4, 0, 0, 4, 0, 0]
[0, 0, 2, 0, 0, 5, 0, 0, 5, 0]
[0, 0, 0, 3, 0, 0, 6, 0, 0, 6]
[0, 1, 0, 0, 4, 0, 0, 7, 0, 0]
[0, 0, 2, 0, 0, 5, 0, 0, 8, 0]
[0, 0, 0, 3, 0, 0, 6, 0, 0, 9]
Note the nice diagonals there. As you see the first diagonal ends with 3, second with 6 and third with 9 (our original DP solution which we do not care).
This is not a coincidence. Hope that after looking in more details about how DP matrix is constructed you can see that these diagonals correspond to duplicate strings.
Here is an example for s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
and the very last row in the matrix is:
[0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 17, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 34, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 51, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 68]
.
As you see big numbers (17, 34, 51, 68) there correspond to the end of the diagonals (there is also some noise there just because I specifically added small duplicate letters like aaa
).
Which suggest that we can just find the gcd of biggest two numbers gcd(68, 51) = 17
which will be the length of our repeated substring.
Here just because we know that the the whole string consists of repeated substrings, we know that it starts at the 0-th position (if we do not know it we would need to find the offset).
And here we go: the string is "aaabasdfwasfsdtas"
.
P.S. this method allows you to find repeats even if they are slightly modified.
For people who would like to play around here is a python script (which was created in a hustle so feel free to improve):
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
else:
m[x][y] = 0
return m
s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
m = longest_common_substring(s, s)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
M = np.array(m)
print m[-1]
arr = np.asarray(M)
plt.imshow(arr, cmap = cm.Greys_r, interpolation='none')
plt.show()
I told about the easy way, and forgot to write about the hard way. It is getting late, so I will just explain the idea. The implementation is harder and I am not sure whether it will give you better results. But here it is:
Use the algorithm for longest repeated substring (you will need to implement trie or suffix tree which is not easy in php).
After this:
s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
s1 = largest_substring_algo1(s)
Took the implementation of largest_substring_algo1 from here. Actually it is not the best (just for showing the idea) as it does not use the above-mention data-structures. The results for s
and s1
are:
aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas
aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaa
As you see the difference between them is actually the substring which was duplicated.
Upvotes: 3
Reputation: 32912
Another idea would be to use substr_count iteration:
$str = "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace";
$rep = "";
$str = strtolower($str);
for($i=0,$len=strlen($str),$pattern=""; $i<$len; ++$i) {
$pattern.= $str[$i];
if(substr_count($str,$pattern)>1)
$rep = strlen($rep)<strlen($pattern) ? $pattern : $rep;
else
$pattern = "";
}
// warn if 20%+ of the string is repetitive
if(strlen($rep)>strlen($str)/5) echo "Repetitive string alert!";
else echo "String seems to be non-repetitive.";
echo " Longest pattern found: '$rep'";
Which would output
Repetitive string alert! Longest pattern found: 'love a and peace love a and peace love a and peace'
Upvotes: 7
Reputation: 21353
This is a basic text classification problem. There are lots of articles out there on how to determine if some text is spam/not spam which I'd recommend digging into if you really want to get into the details. A lot of it is probably overkill for what you need to do here.
Granted one approach would be to evaluate why you're requiring people to enter longer bios, but I'll assume you've already decided that forcing people to enter more text is the way to go.
Here's an outline of what I would do:
This approach would require you to figure out what's different between the two sets. Intuitively, I'd expect spam to show fewer unique words and if you plot the histogram values, a higher area under the curve concentrated toward the top words.
Here's some sample code to get you going:
$str = 'Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace';
// Build a histogram mapping words to occurrence counts
$hist = array();
// Split on any number of consecutive whitespace characters
foreach (preg_split('/\s+/', $str) as $word)
{
// Force all words lowercase to ignore capitalization differences
$word = strtolower($word);
// Count occurrences of the word
if (isset($hist[$word]))
{
$hist[$word]++;
}
else
{
$hist[$word] = 1;
}
}
// Once you're done, extract only the counts
$vals = array_values($hist);
rsort($vals); // Sort max to min
// Now that you have the counts, analyze and decide valid/invalid
var_dump($vals);
When you run this code on some repetitive strings, you'll see the difference. Here's a plot of the $vals
array from the example string you gave:
Compare that with the first two paragraphs of Martin Luther King Jr.'s bio from Wikipedia:
A long tail indicates lots of unique words. There's still some repetition, but the general shape shows some variation.
FYI, PHP has a stats package you can install if you're going to be doing lots of math like standard deviation, distribution modeling, etc.
Upvotes: 23
Reputation: 2071
I think the approach of finding duplicate words, will be messy. Most likely you'll get duplicate words in real descriptions "I really, really, really, like ice creme, especially vanilla ice creme".
A better approach, is to split the string to get the words, find all the unique words, add all the character counts of the unique words, and set that too some limit. Say, you require 100 character descriptions, require around 60 unique characters from words.
Copying @ficuscr's approach
$words = str_word_count("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace", 1);
$total = 0;
foreach ($words as $key => $count) { $total += strlen($key) }
Upvotes: 3
Reputation: 987
Here's a the code of the function you're looking for in the description:
<?php
function duplicate(){
$txt = strtolower("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace");
$strings = explode(" ",$txt);
$set = 2 ;
for($i=0;$i < sizeof($strings);$i++){
$count = 0;
$current = $strings[$i];
for($j=$i+1;$j < sizeof($strings);$j++){
if($strings[$j]!==$current){
continue;
}else if($count<$set){
$count++;
}else{
echo ("String ".$current." repeated more than ".$set." times\n");
}
}
}
}
echo("Hello World!\n");
duplicate();
?>
Upvotes: 3
Reputation: 611
// 3 examples of how you might detect repeating user input
// use preg_match
// pattern to match agains
$pattern = '/^text goes here$/';
// the user input
$input = 'text goes here';
// check if its match
$repeats = preg_match($pattern, $input);
if ($repeats) {
var_dump($repeats);
} else {
// do something else
}
// use strpos
$string = 'text goes here';
$input = 'text goes here';
$repeats = strpos($string, $input);
if ($repeats !== false) {
# code...
var_dump($repeats);
} else {
// do something else
}
// or you could do something like:
function repeatingWords($str)
{
$words = explode(' ', trim($str)); //Trim to prevent any extra blank
if (count(array_unique($words)) == count($words)) {
return true; //Same amount of words
}
return false;
}
$string = 'text goes here. text goes here. ';
if (repeatingWords($string)) {
var_dump($string);
} else {
// do something else
}
Upvotes: 4
Reputation: 7054
I think you are on the right track breaking down the string and looking at repeated words.
Here is some code though which does not use a PCRE and leverages PHP native string functions (str_word_count and array_count_values):
<?php
$words = str_word_count("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace", 1);
$words = array_count_values($words);
var_dump($words);
/*
array(5) {
["Love"]=>
int(1)
["a"]=>
int(6)
["and"]=>
int(6)
["peace"]=>
int(6)
["love"]=>
int(5)
}
*/
Some tweaks might be to:
Upvotes: 12