Reputation: 61

how to delete all characters starting from the nth position for every word using bash?

I have a file containing 1,700,000 words. I want to do naive stemming of the words, if a word's length is more than 6 characters, I delete all characters after 6th position. For example:

Input:
Everybody is around
Everyone keeps talking 

Output: 
Everyb is around
Everyo keeps talkin

I wrote the following script:

INPUT=train.txt
while read line; do
for word in $line; do
new="$(echo $word | awk '{print substr($0,1,6);exit}')"
echo -n $new >> train_stem_6.txt
echo -n ' ' >> train_stem_6.txt
done
echo   ' ' >> train_stem_6.txt
done < "$INPUT"

This answers the question perfectly, but it is extremely slow, and since I have 1,700,000 words, it takes forever. Is there a faster way to do this using bash script.

Thanks a lot,

Upvotes: 1

Answers (4)

agc

Reputation: 8406

Pure bash, (i.e. not POSIX), as a one-liner:

while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done < train.txt

...and the same code reformatted for clarity:

while read x ; do
    set -- $x
    for f in $* ; do 
        echo -n ${f:0:6}" "
    done
    echo
done < train.txt

Note: repeated whitespace becomes a single space.

Test run, first make a function using above code, with standard input:

len6() { while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done ; }

Invoke:

COLUMNS=90 man bash | tail | head -n 5 | len6

Output:

gracef when proces suspen is attemp When a proces is stoppe the 
shell immedi execut the next comman in the sequen It suffic to 
place the sequen of comman betwee parent to force it into a subshe 
which may be stoppe as a unit.

Upvotes: 0

anubhava

Reputation: 785058

You can use this gnu awk using custom RS:

awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file

Everyb is around
Everyo keeps talkin

Timings of 3 commands on 11 MB input file:

sed:

time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' file >/dev/null

real    0m2.913s
user    0m2.878s
sys     0m0.020s

awk command by @andlrc:

time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' file >/dev/null

real    0m1.191s
user    0m1.174s
sys     0m0.011s

My suggested awk command:

time awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file >/dev/null

real    0m1.926s
user    0m1.905s
sys     0m0.013s

So both awk commands are taking pretty much same time to finish the job and sed tends to be slower on bigger files.

3 commands on 167mb file

$ time awk -v RS='[[:space:]]+' 'RT{ORS=RT} {$1=substr($1, 1, 6)} 1' test > /dev/null

real    0m29.070s
user    0m28.898s
sys     0m0.060s
$ time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' test >/dev/null

real    0m13.897s
user    0m13.805s
sys     0m0.036s

$ time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' test > /dev/null

real    0m40.525s
user    0m40.323s
sys     0m0.064s

Upvotes: 4

Andreas Louv

Reputation: 47099

You can use awk for this:

awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' train.txt

Breakdown:

{                          
  for(i=1;i<=NF;i++) {      # Iterate over each word
    $i = substr($i, 1, 6);  # Shrink it to a maximum of 6 characters
  }                         
}                           
1                           # Print the row

This will however treat Awesome, as a word and therefore remove e,

Upvotes: 3

gudok

Reputation: 4179

Do you consider using sed?

sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g'

Upvotes: 3

how to delete all characters starting from the nth position for every word using bash?

Answers (4)

Related Questions