rcubefather
rcubefather

Reputation: 1564

TCL-REGEX:: How to filter a line that appears multiple times in a text file using TCL regexp

Input file (resultnew.txt):

www.maannews.net.

www.maannews.net.

 ################################################# 

attach2.mobile01.com.

www.google-analytics.

attach2.mobile01.com.

attach2.mobile01.com.

www.google-analytics.

attach2.mobile01.com.

attach2.mobile01.com.

attach2.mobile01.com.

attach2.mobile01.com.

attach2.mobile01.com.

www.google.com.

attach2.mobile01.com.

attach2.mobile01.com.

attach2.mobile01.com.

 ################################################# 

cdn-img.mocospace.com

cdn-img.mocospace.com

www.mocospace.com.

cdn-img.mocospace.com

cdn-img.mocospace.com

cdn-img.mocospace.com

www.mocospace.com.

cdn-img.mocospace.com

www.mocospace.com.

www.google-analytics.

www.google-analytics.

fonts.gstatic.com.

cdn-img.mocospace.com

cdn-img.mocospace.com

fonts.gstatic.com.

fonts.gstatic.com.

 ################################################# 

My TCL Script:

set a [open resultnew.txt r]
set b [open balu_output.txt w]


while {[gets $a a1] >=0} {
    if {[regexp {[a-zA-Z\.]} $a1]} {
    puts $b $a1
    }
}

My Requirement:

  1. From the above text file, I want to remove the lines that appears multiple times and want to print only one time into a new file.
  2. Point 1 should happen between each "#################" and "#################". Still "################# should appear in that text file".

Please help me with your ideas. Thanks in advance.

Thanks,

Balu P.

Upvotes: 1

Views: 1379

Answers (2)

narendra
narendra

Reputation: 1278

What I understand from your question is that you need distinct value between the comment line (i.e hashess...... ). Below is the solution which you are looking for ... basically in script array key is used to keep unique values and re-initializing the array whenever next divider line (i.e your hash comment line is seen )...

I printed values on the STDOUT you can redirect them to other file.

#!/usr/bin/tclsh
set a [open resultnew.txt r]

# set an array to keep the unique records
array set myarray {}

# For each line in the input
while {[gets $a a1] >= 0} {

    # Get rid of extra spaces
    set a1 [string trim $a1]

    # if divider line found then print it (i.e ####)
    if { [string match "#*" $a1] } {
      puts $a1
      # unset the array for next set of entries
      array unset myarray
    } else {
      # Ignore empty lines
      if {$a1 ne "" }  {
        # print only if doesnot exists in the array
        if { [info exists myarray($a1) ] } {
          set myarray($a1) 1 
        } else {
          puts $a1
          set myarray($a1) 1
        }
     }
  }
}

Output of the script using your input file

$tclsh main.tcl
www.maannews.net.
#################################################
attach2.mobile01.com.
www.google-analytics.
www.google.com.
#################################################
cdn-img.mocospace.com
www.mocospace.com.
www.google-analytics.
fonts.gstatic.com.
#################################################

Upvotes: 1

Donal Fellows
Donal Fellows

Reputation: 137787

You need a different way to check whether to ignore lines, and an array is great for doing the uniqueness check. Here's an annotated version:

# For each line in the input
while {[gets $a a1] >= 0} {
    # Get rid of extra spaces
    set a1 [string trim $a1]
    # Ignore empty and comment lines; [string match] is great for this!
    if {$a1 eq "" || [string match "#*" $a1]} {
        continue
    }
    # See if this is the first time we've seen a line
    if {[incr occurrences($a1)] == 1} {
        # It is! Print it now
        puts $b $a1
    }
}

If you have a horribly large file you might eventually run into problems with memory usage. But for files with (up to) just a few million lines, you should be fine.

Upvotes: 1

Related Questions