Reputation: 61
I have a file containing 1,700,000 words. I want to do naive stemming of the words, if a word's length is more than 6 characters, I delete all characters after 6th position. For example:
Input:
Everybody is around
Everyone keeps talking
Output:
Everyb is around
Everyo keeps talkin
I wrote the following script:
INPUT=train.txt
while read line; do
for word in $line; do
new="$(echo $word | awk '{print substr($0,1,6);exit}')"
echo -n $new >> train_stem_6.txt
echo -n ' ' >> train_stem_6.txt
done
echo ' ' >> train_stem_6.txt
done < "$INPUT"
This answers the question perfectly, but it is extremely slow, and since I have 1,700,000 words, it takes forever. Is there a faster way to do this using bash script.
Thanks a lot,
Upvotes: 1
Views: 112
Reputation: 8406
Pure bash, (i.e. not POSIX), as a one-liner:
while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done < train.txt
...and the same code reformatted for clarity:
while read x ; do
set -- $x
for f in $* ; do
echo -n ${f:0:6}" "
done
echo
done < train.txt
Note: repeated whitespace becomes a single space.
Test run, first make a function using above code, with standard input:
len6() { while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done ; }
Invoke:
COLUMNS=90 man bash | tail | head -n 5 | len6
Output:
gracef when proces suspen is attemp When a proces is stoppe the
shell immedi execut the next comman in the sequen It suffic to
place the sequen of comman betwee parent to force it into a subshe
which may be stoppe as a unit.
Upvotes: 0
Reputation: 785058
You can use this gnu awk using custom RS
:
awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file
Everyb is around
Everyo keeps talkin
Timings of 3 commands on 11 MB input file:
sed:
time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' file >/dev/null
real 0m2.913s
user 0m2.878s
sys 0m0.020s
awk command by @andlrc:
time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' file >/dev/null
real 0m1.191s
user 0m1.174s
sys 0m0.011s
My suggested awk command:
time awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file >/dev/null
real 0m1.926s
user 0m1.905s
sys 0m0.013s
So both awk commands are taking pretty much same time to finish the job and sed tends to be slower on bigger files.
3 commands on 167mb file
$ time awk -v RS='[[:space:]]+' 'RT{ORS=RT} {$1=substr($1, 1, 6)} 1' test > /dev/null
real 0m29.070s
user 0m28.898s
sys 0m0.060s
$ time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' test >/dev/null
real 0m13.897s
user 0m13.805s
sys 0m0.036s
$ time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' test > /dev/null
real 0m40.525s
user 0m40.323s
sys 0m0.064s
Upvotes: 4
Reputation: 47099
You can use awk for this:
awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' train.txt
Breakdown:
{
for(i=1;i<=NF;i++) { # Iterate over each word
$i = substr($i, 1, 6); # Shrink it to a maximum of 6 characters
}
}
1 # Print the row
This will however treat Awesome,
as a word and therefore remove e,
Upvotes: 3