user3510922
user3510922

Reputation: 67

Remove lines contains specific number of words

My question can be understood with an example given below :

Suppose This is the text file, which contains these lines :

hello this is my word file and this is line number 1
hello this is second line and this is some text
hello this is third line and again some text
jhasg djgha sdgasjhgdjasgh jdkh
sdhgfkjg sdjhgf sjkdghf sdhf
s hdg fjhsgd fjhgsdj gfj ksdgh

so in the above example the output should be :

hello this is my word file and this is line number 1
jhasg djgha sdgasjhgdjasgh jdkh
sdhgfkjg sdjhgf sjkdghf sdhf
s hdg fjhsgd fjhgsdj gfj ksdgh

Because hello this is line is more than 3 words, so the lines containing those words are deleted. Please note that the first line is not deleted because it is unique....

I tried to code myself and created a mess which created 200mb text file with the unlimited number of first line text. Anyways here is the code, dont execute it else you can end up having your hard disk full.

<?php

$fileA = fopen("names.txt", "r");
$fileB = fopen("anothernames.txt", "r");
$fileC = fopen("uniquenames.txt", "w");
while(!feof($fileA))
{
    $line = fgets($fileA);
    $words = explode(" ", $line);
    $size = count($words);

    while(!feof($fileA))
    {
        $line1 = fgets($fileB);
        $words1 = explode(" ", $line1);
        $size1 = count($words1);

        $c=0;

        for($i=0; $i<$size; $i++)
        {
                for($j=0; $j<$size1; $j++)
            {
                    if($words[$i]==$words1[$j])
                        $c++;
            }
        }
        if($c<3)
            fwrite($fileC, $line);
    }
}

fclose($fileA);
fclose($fileB);
fclose($fileC);

?>

Thanks

Upvotes: 1

Views: 509

Answers (3)

dognose
dognose

Reputation: 20899

An easy approach would be the following:

  • read all the lines, using file()
  • create an array, containing the sentence, indexed by each word.
  • finally build a blacklist of every sentence which appears in any of the arrays, counting more than 3 entries for any word.
  • Then print every line, except the blacklisted:

Example:

    <?php
$lines = array("hello this is my word file and this is line number 1",
  "hello this is second line and this is some text",
  "hello this is third line and again some text",
  "jhasg djgha sdgasjhgdjasgh jdkh",
  "sdhgfkjg sdjhgf sjkdghf sdhf",
  "s hdg fjhsgd fjhgsdj gfj ksdgh");

//$lines = file("path/to/file");

$result = array();
//build "count-per-word" array
foreach ($lines AS $line){
   $words = explode(" ", $line);
   foreach ($words AS $word){
       $word = strtolower($word);
       if (isset($result[$word]))
           $result[$word][] = $line;
       else
           $result[$word] = array($line);  
   }
}

//Blacklist each sentence, containing a word appearing in 3 sentences.
$blacklist = array();
foreach ($result AS $word => $entries){
   if (count($entries) >= 3){
     foreach($entries AS $entry){
       $blacklist[] = $entry;
     }
   }
}

//list all not blacklisted. 
foreach ($lines AS $line){
  if (!in_array($line, $blacklist))
      echo $line."<br />";
}

?>

Output:

jhasg djgha sdgasjhgdjasgh jdkh
sdhgfkjg sdjhgf sjkdghf sdhf
s hdg fjhsgd fjhgsdj gfj ksdgh

Note, that this will also blacklist a single sentence containing 3 times the same word, such as "Foo Foo Foo bar".

To aovid this, check if the line is already "known" for a certain word before pushing it to the array:

foreach ($words AS $word){
   if (isset($result[$word])){
       if (!in_array($line, $result[$word])){
          $result[$word][] = $line;
       }
   }else
       $result[$word] = array($line);  
}

Upvotes: 1

Marc B
Marc B

Reputation: 360742

Why not just array_intersect?

php > $l1 = 'hello this is my word file and this is line number 1';
php > $l2 = 'hello this is second line and this is some text';
php > $a1 = explode(" ", $l1);
php > $a2 = explode(" ", $l2);
php > var_dump(array_intersect($a1, $a2));
array(7) {
  [0]=>
  string(5) "hello"
  [1]=>
  string(4) "this"
  [2]=>
  string(2) "is"
  [6]=>
  string(3) "and"
  [7]=>
  string(4) "this"
  [8]=>
  string(2) "is"
  [9]=>
  string(4) "line"
}


if (count of intersection >= 3) {
  skip line
}

Or am I reading your "matching" too loosely?

Upvotes: 0

statelessMind
statelessMind

Reputation: 54

#second 
while(!feof($fileA))
#should be
while(!feof($fileB))

and

if($c<3)
        fwrite($fileC, $line);
#should
if($c<3){
   fwrite($fileC, $line);
   continue 2;
}

but

then compare that array which contains words of that line WITH all the words of next lines

makes only sence when you compare the file with itself!

EDIT:my post makes no sence at all, read note from prev post!

Upvotes: 0

Related Questions