Mildred Shimz
Mildred Shimz

Reputation: 617

Finding the longest word in a text file

I am trying to make a a simple script of finding the largest word and its number/length in a text file using bash. I know when I use awk its simple and straight forward but I want to try and use this method...lets say I know if a=wmememememe and if I want to find the length I can use echo {#a} its word I would echo ${a}. But I want to apply it on this below

for i in `cat so.txt` do

Where so.txt contains words, I hope it makes sense.

Upvotes: 21

Views: 19663

Answers (8)

BlessedKey
BlessedKey

Reputation: 1635

bash one liner.

sed 's/ /\n/g' YOUR_FILENAME | sort | uniq | awk '{print length, $0}' | sort -nr | head -n 1
  1. read file and split the words (via sed)
  2. remove duplicates (via sort | uniq)
  3. prefix each word with it's length (awk)
  4. sort the list by the word length
  5. print the single word with greatest length.

yes this will be slower than some of the above solutions, but it also doesn't require remembering the semantics of bash for loops.

Upvotes: 33

agc
agc

Reputation: 8406

  1. Relatively speedy bash function using no external utils:

    # Usage: longcount <  textfile
    longcount () 
    { 
        declare -a c;
        while read x; do
            c[${#x}]="$x";
        done;
        echo ${#c[@]} "${c[${#c[@]}]}"
    }
    

    Example:

    longcount < /usr/share/dict/words
    

    Output:

    23 electroencephalograph's
    
  2. 'Modified POSIX shell version of jimis' xargs-based answer; still very slow, takes two or three minutes:

    tr "'" '_'  < /usr/share/dict/words |
    xargs -P$(nproc) -n1 -i sh -c 'set -- {} ; echo ${#1} "$1"' | 
    sort -n | tail | tr '_' "'"
    

    Note the leading and trailing tr bit to get around GNU xargs difficulty with single quotes.

Upvotes: 1

jimis
jimis

Reputation: 902

Slow because of the gazillion of forks, but pure shell, does not require awk or special bash features:

$ cat /usr/share/dict/words | \
    xargs -n1 -I '{}' -d '\n'   sh -c 'echo `echo -n "{}" | wc -c` "{}"' | \
    sort -n | tail
23 Pseudolamellibranchiata
23 pseudolamellibranchiate
23 scientificogeographical
23 thymolsulphonephthalein
23 transubstantiationalist
24 formaldehydesulphoxylate
24 pathologicopsychological
24 scientificophilosophical
24 tetraiodophenolphthalein
24 thyroparathyroidectomize

You can easily parallelize, e.g. to 4 CPUs by providing -P4 to xargs.

EDIT: modified to work with the single quotes that some dictionaries have. Now it requires GNU xargs because of -d argument.

EDIT2: for the fun of it, here is another version that handles all kinds of special characters, but requires the -0 option to xargs. I also added -P4 to compute on 4 cores:

cat /usr/share/dict/words | tr '\n' '\0' | \
    xargs -0 -I {} -n1 -P4  sh -c  'echo ${#1} "$1"'  wordcount {} | \
    sort -n | tail

Upvotes: -1

jaypal singh
jaypal singh

Reputation: 77145

awk script:

#!/usr/bin/awk -f

# Initialize two variables
BEGIN {
  maxlength=0;
  maxword=0
} 

# Loop through each word on the line
{
  for(i=1;i<=NF;i++) 

  # Assign the maxlength variable if length of word found is greater. Also, assign
  # the word to maxword variable.
  if (length($i)>maxlength) 
  {
    maxlength=length($i); 
    maxword=$i;
  }
}

# Print out the maxword and the maxlength  
END {
  print maxword,maxlength;
}

Textfile:

[jaypal:~/Temp] cat textfile 
AWK utility is a data_extraction and reporting tool that uses a data-driven scripting language 
consisting of a set of actions to be taken against textual data (either in files or data streams) 
for the purpose of producing formatted reports. 
The language used by awk extensively uses the string datatype, 
associative arrays (that is, arrays indexed by key strings), and regular expressions.

Test:

[jaypal:~/Temp] ./script.awk textfile 
data_extraction 15

Upvotes: 3

jbleners
jbleners

Reputation: 1043

for i in $(cat so.txt); do echo ${#i}; done | paste - so.txt | sort -n | tail -1

Upvotes: 0

Fritz G. Mehner
Fritz G. Mehner

Reputation: 17198

Another solution:

for item in  $(cat "$infile"); do
  length[${#item}]=$item          # use word length as index
done
maxword=${length[@]: -1}          # select last array element

printf  "longest word '%s', length %d" ${maxword} ${#maxword}

Upvotes: 8

Dennis Williamson
Dennis Williamson

Reputation: 360345

Normally, you'd want to use a while read loop instead of for i in $(cat), but since you want all the words to be split, in this case it would work out OK.

#!/bin/bash
longest=0
for word in $(<so.txt)
do
    len=${#word}
    if (( len > longest ))
    then
        longest=$len
        longword=$word
    fi
done
printf 'The longest word is %s and its length is %d.\n' "$longword" "$longest"

Upvotes: 14

Rob Wouters
Rob Wouters

Reputation: 16327

longest=""
for word in $(cat so.txt); do
    if [ ${#word} -gt ${#longest} ]; then
        longest=$word
    fi
done

echo $longest

Upvotes: 5

Related Questions