Reputation: 5281

splitting a text file based on number of words in bash

I know to extract a subset of lines from a file I can do:

sed -n 2208202,2218201p file >>new

Is there a way in bash to extract a subset of a file (preserving the exact order) based on words? For example to extract the top 10k words of a file, or words from 10000 to 20000?

Upvotes: 3

Answers (3)

John1024

Reputation: 113824

With this as the test file:

$ cat file
one two
three four five
six seven
eight nine
ten eleven twelve
thirteen
fourteen

Using GNU awk (gawk), let's select words 4 through 10:

$ awk -v RS='[[:space:]]+' '4<=NR && NR<=10{ printf "%s%s",$0,RT } END{print""}' file
four five
six seven
eight nine
ten

Note that this preserves the white space and line breaks of the the original file.

How it works

-v RS='[[:space:]]+'

This sets awk's record separator to any combination of white space.
4<=NR && NR<=10{ printf "%s%s",$0,RT }

For records 4 through 10, this prints the record with whatever whitespace followed it in the input file. RT is not POSIX.
END{print""}

This prints a final newline which is needed if the final word was not the last on a line.

Upvotes: 3

mklement0

Reputation: 437208

Assuming that:

you define word as any run of non-whitespace characters
you use GNU Awk or Mawk

try:

awk -v from=10000 -v to=20000 -v RS='[[:space:]]+' 'NR < from {next} NR > to {exit} 1' file

^{- Simply omit -v from=... to start with the first word.

- This solution prints each word on its own line on output; if, by contrast, you want to preserve the original whitespace between words, see John1024's helpful answer.}

RS='[[:space:]]+' defines the input-record separator (RS) as any run of whitespace, which effectively makes each run of non-whitespace characters its own record.
- It it is the use of a multi-character RS value that makes this solution non-POSIX-compliant; BSD awk, as also used on OS X, stays close to the POSIX spec. and therefore doesn't support such an RS value.
NR < from {next} skips input records as long as their 1-based record index NR is below the start index of the range.
NR > to {exit} exits altogether once the record index exceeds the end index of the range. This can be an important optimization with large input files.
1, a common shorthand for { print }, prints each word on its own line, because print prints each input record followed by the value of ORS, the output-record separator, which defaults to \n.
Caveat: A run of whitespace preceding the first word is reported as an empty word (record).

Upvotes: 1

karakfa

Reputation: 67467

awk to the rescue!

this should work with other awks too

$ awk -v n=15 'sum<n && p{print p} 
                         {p=$0; sum+=NF} 
                   sum>=n{exit} 
                      END{for(i=1;i<=n-sum+NF;i++) printf "%s ", $i; 
                          print ""}' file.txt

this is the first n words script. Range can be implemented similarly.

Upvotes: 1

splitting a text file based on number of words in bash

Answers (3)

How it works

Related Questions