chomp
chomp

Reputation: 125

record the lines in which each word in a given file appears using awk

Having a few problems doing this. The output needs to be of the following format: on each line, a word is first printed, followed by a colon “:”, then a space, and then the list of the line numbers where the word appears (separated by comma). If a word appears in a line multiple times, it should only report one time for that line.

Command line: index.awk test1.txt > new.output.txt

My code (currently):

    #!/bin/awk -f


Begin {lineCount=1}                    # start line count at 1

{         
    for (i = 1; i <= NF; i++)          # loop through starting with postition 1
       for ( j = 2; j <= NF; j++)      # have something to compare  
          if ( $i == $j )              # see if they match
              print $i ":" lineCount   # if they do print the word and line number
              lineCount++              # increment the line number

}

You'll notice down below in the sample output that it completely skips over the first line from the input text file. It counts correctly from there. How can I print the word occurrences if it appears more than once? As well, is there a native function to awk that can account for erroneous characters such as punctuation, numbers, [], (), ect...

(EDIT: gsub(regexp, replacement, target) can omit these erroneous characters from the text.

Sample INPUT: I would like to print out each word, and the corresponding lines which the word occurs on. I need to make sure I omit the punctuation's from the strings when printing them out. As well, I need to make sure if the word occurs more than once on a line not to print the line number twice.

SAMPLE OUTPUT: 

I:
would:
like:
to:
print:
out:
each:
word:
and,:
the:1
corresponding:
lines:
which:
the:
word:
occurs:
on.:
I:1
need:1
to:1
make:1
sure:1
.....ect (outputs the line numbers correctly from here)

Upvotes: 2

Views: 1281

Answers (1)

John1024
John1024

Reputation: 113924

awk '{delete u;for (i=1;i<=NF;i++) u[$i]=1; for (i in u) cnt[i]=cnt[i]NR","} END{for (i in cnt) {sub(/,$/,"",cnt[i]); printf "%s: %s\n",i,cnt[i]}}' input

As an example (somewhat shorter text than your example):

$ cat file
I and I and I went
here and here and there
and then home

$ awk '{delete u;for (i=1;i<=NF;i++) u[$i]=1; for (i in u) cnt[i]=cnt[i]NR","} END{for (i in cnt) {sub(/,$/,"",cnt[i]); printf "%s: %s\n",i,cnt[i]}}' file
there: 2
went: 1
here: 2
and: 1,2,3
then: 3
I: 1
home: 3

How it works

The program uses three variables: i, u and cnt. u is used to create a unique list of words on each line. cnt is used to track the line numbers for each word. i is used as a temporary variable in loops.

This code uses the fact that awk implicitly loops over every line in a file. After the last line is read, the END clause is executed which displays the results.

Considering each command in turn:

  • delete u

    At the start of each line, we want the array u to be empty.

  • for (i=1;i<=NF;i++) u[$i]=1

    Create an entry in array u for each word on the line.

  • for (i in u) cnt[i]=cnt[i]NR","

    For each word on the line, add the current line number to the array cnt.

  • END{for (i in cnt) {sub(/,$/,"",cnt[i]); printf "%s: %s\n",i,cnt[i]}

    After processing the last line, print out each entry in array cnt. Each entry in cnt has an extra trailing comma. That comma is removed with the sub command. Then printf formats the ouput.

Refinements

Suppose that we want to ignore differences in case. To do that, we can convert all words to lower case:

$0=tolower($0)

If we also want to ignore punctuation, we can remove it:

gsub(/[-.,"!?/]/," ")

Putting it all together:

awk '{delete u;$0=tolower($0);gsub(/[-.,"!?/]/," ");for (i=1;i<=NF;i++) u[$i]=1; for (i in u) cnt[i]=cnt[i]NR","} END{for (i in cnt) {sub(/,$/,"",cnt[i]); printf "%s: %s\n",i,cnt[i]}}' file

Upvotes: 3

Related Questions