Simo
Simo

Reputation: 49

Remove specific duplicate lines without sorting

I have a text file with around 5000 lines, i have to delete specific duplicate lines(which doesnt contain the word "Niveau" or "stime") but keeping the first occurrence and without sorting, the text pattern look like this:

vide vide Time: stime 3:30 PM vide vide  
NN NN NP stime LS NP NN NN  
 ----------Niveau 1--------------  
Time: | 0 | 263.0 | 266.0 | 0,0113  
NP | 0 | 0.0 | 24885.0 | 1  
3:30 | -0 | 104.0 | 120.0 | 0,1333  
LS | -0 | 0.0 | 13134.0 | 1  
PM | -1 | 134.0 | 238.0 | 0,437  
NP | -1 | 0.0 | 24885.0 | 1  
 ----------Niveau 2--------------  
3:30 PM | -0 | 30.0 | 41.0 | 0,2683  
3:30 NP | -0 | 133.0 | 55.0 | -1,4182  
LS PM | -0 | 42.0 | 237.0 | 0,8228  
LS NP | -0 | 0.0 | 2456.0 | 1  
 ----------Niveau 3--------------  


vide vide Time: stime 3:30 pm vide vide   
NN NN NP stime LS NN NN NN   
 ----------Niveau 1--------------  
Time: | 0 | 263.0 | 266.0 | 0,0113  
NP | 0 | 0.0 | 24885.0 | 1  
3:30 | -0 | 104.0 | 120.0 | 0,1333  
LS | -0 | 0.0 | 13134.0 | 1  
pm | -1 | 38.0 | 54.0 | 0,2963  
NN | -1 | 0.0 | 59511.0 | 1  
 ----------Niveau 2--------------  
3:30 pm | -0 | 9.0 | 9.0 | 0  
3:30 NN | -0 | 36.0 | 24.0 | -0,5  
LS pm | -0 | 22.0 | 52.0 | 0,5769  
LS NN | -0 | 0.0 | 2658.0 | 1  
 ----------Niveau 3--------------  

Expected results:

vide vide Time: stime 3:30 PM vide vide  
NN NN NP stime LS NP NN NN  
 ----------Niveau 1--------------  
Time: | 0 | 263.0 | 266.0 | 0,0113  
NP | 0 | 0.0 | 24885.0 | 1  
3:30 | -0 | 104.0 | 120.0 | 0,1333  
LS | -0 | 0.0 | 13134.0 | 1  
PM | -1 | 134.0 | 238.0 | 0,437  
NP | -1 | 0.0 | 24885.0 | 1  
 ----------Niveau 2--------------  
3:30 PM | -0 | 30.0 | 41.0 | 0,2683  
3:30 NP | -0 | 133.0 | 55.0 | -1,4182  
LS PM | -0 | 42.0 | 237.0 | 0,8228  
LS NP | -0 | 0.0 | 2456.0 | 1  
 ----------Niveau 3--------------  


vide vide Time: stime 3:30 pm vide vide   
NN NN NP stime LS NN NN NN   
 ----------Niveau 1--------------     
pm | -1 | 38.0 | 54.0 | 0,2963  
NN | -1 | 0.0 | 59511.0 | 1  
 ----------Niveau 2--------------  
3:30 pm | -0 | 9.0 | 9.0 | 0  
3:30 NN | -0 | 36.0 | 24.0 | -0,5  
LS pm | -0 | 22.0 | 52.0 | 0,5769  
LS NN | -0 | 0.0 | 2658.0 | 1  
 ----------Niveau 3--------------  

By using Notepad++ and TextFX plugin, I hide lines containing the words "Niveau" and "stime" and then I use this regex ^(.*?)$\s+?^(?=.*^\1$) in the search and replace dialogue as suggested in the second solution in this post, when I click replace all, it removes all the lines, I get a blank file text, am I doing something wrong?

Upvotes: 4

Views: 1914

Answers (3)

shaiki siegal
shaiki siegal

Reputation: 392

using awk

  awk '(a[$0]++==0)||(/Nivea|stime/)' file
  1. (a[$0]++==0) - a[$0](dictionary named a with a key of line's string), ++ increment value by 1 (by default value that was not initialized eq 0), ==0 - check that $0 (line) was seen first time (value is updated/incremented after equation is checked )

  2. (/Nivea|stime/) - line include at list one of the words "Nivea" or "stime"

  3. || if one of 1 or 2 is true line analyzed will be printed to screen

Upvotes: 2

Gurmanjot Singh
Gurmanjot Singh

Reputation: 10360

The below regex works fine BUT to make it work, one has to click on the replace button as many number of times as there are repetitions. For example, in the shared example by OP, there are 4 such lines which need replacement, so one has to click on the replace button 4 times. I understand that this may not be an efficient solution for large files but its my best attempt to this question.

^(?!(?:\s*$|.*(?:Niveau|stime)))(.*$)([\s\S]*?)(\1\s*)

Replace the matches with \1\2

Here is the regex demo which illustrates the replacement of only the 1st repetitive row. One has to repeat this replacement multiple times to get rid of all expect first of each repetitive line.

Regex Explanation:

  • ^ - asserts the start of the line
  • ^(?!(?:\s*$|.*(?:Niveau|stime))) - negative lookahead to make sure that the line is not an empty line or the line does not contain the words Niveau or stime
  • (.*$) - matches and captures the contents of a line in group 1. In Group 1, we attempt to capture the line which may have repetitions somewhere later in the file.
  • ([\s\S]*?) - matches 0+ occurrences of any character, as few as possible and captures it as Group 2
  • (\1\s*) - matches the contents of Group 1 followed by 0+ occurrences of whitespace. If such a match is present, capture it in group 3. We need to discard the group 3 contents from the file as it is nothing but a repetition of line captured in group 1.

I can explain it better with multiple screenshots below:

Before doing even a single replacement, my file looks like this:

enter image description here

We need to remove the lines A,B, C and D. Since there are 4 such lines, we have to click on the replace button 4 times as shown in the below few screenshots.


After clicking on the replace for the 1st time, the line A is removed and only B, C and D are left

enter image description here


After clicking the replace for the 2nd time, the line B is also removed and only line C and D are left as shown below:

enter image description here


After clicking the replace for the 3rd time, the line C is also removed and only line D is left.

enter image description here


After clicking the replace for the 4th time, line D is also removed and no such repetitive lines are left

enter image description here

Upvotes: 2

user557597
user557597

Reputation:

You'd need a scripting capability because there is no way to remove
the duplicate line without advancing the match position up to that line.

Therefore, you'd have to sit in a loop, restarting from the beginning of the
string until all dup's are removed.

Example Perl while ( str ~= s/regex/$1/g ) {}

It can be done. Might take a little extra time, but it's doable.

Anyway, this is the regex you'd need to do it.

Globally:
Find (?m)((^[^\S\r\n]*?(?=\S)(?:(?!Niveau|stime).)+$)[\S\s]*?)^\2$(?:\r?\n)?
Replace $1

Do this until globally there are no more matches (ie. replacements)

Explained:

 (?m)                          # Multi-line mode
 (                             # (1 start), To be written back
      (                             # (2 start), The line to test
           ^                             # BOL begin of line
           [^\S\r\n]*?                   # Spurious horizontal whitespace
           (?= \S )                      # Must be a non-whitespace ahead
           (?:                           # Skip lines containing these
                (?! Niveau | stime )
                . 
           )+
           $                             # EOL end of line
      )                             # (2 end)
      [\S\s]*?                      # Anything up to the duplicate
 )                             # (1 end)
 ^ \2 $                        # The actual duplicate line    
 (?: \r? \n )?                 # Optional linebreak (if last line, then ok)

Note that the way the regex is now, there is no trimming of horizontal whitespace
at the BOL and EOL, so the text has to be exact.
It's easy, however to add some extra trim if needed.


update

A quicker version of the above regex uses the \K construct to simplify
the replacement.

Globally:

Find (?m)(^[^\S\r\n]*?(?=\S)(?:(?!Niveau|stime).)+$)[\S\s]*?\K^\1$(?:\r?\n)?
Replace '' (nothing)

Explained

 (?m)                          # Multi-line mode
 (                             # (1 start), The line to test
      ^                             # BOL begin of line
      [^\S\r\n]*?                   # Spurious horizontal whitespace
      (?= \S )                      # Must be a non-whitespace ahead
      (?:                           # Skip lines containing these
           (?! Niveau | stime )
           . 
      )+
      $                             # EOL end of line
 )                             # (1 end)
 [\S\s]*?                      # Anything up to the duplicate
 \K                            # Disregard the match up to here
 ^ \1 $                        # The actual duplicate line to be deleted   
 (?: \r? \n )?                 # Optional linebreak (if last line, then ok)

Upvotes: 3

Related Questions