Reputation: 5281
I know to extract a subset of lines from a file I can do:
sed -n 2208202,2218201p file >>new
Is there a way in bash to extract a subset of a file (preserving the exact order) based on words? For example to extract the top 10k words of a file, or words from 10000 to 20000?
Upvotes: 3
Views: 71
Reputation: 113824
With this as the test file:
$ cat file
one two
three four five
six seven
eight nine
ten eleven twelve
thirteen
fourteen
Using GNU awk
(gawk
), let's select words 4 through 10:
$ awk -v RS='[[:space:]]+' '4<=NR && NR<=10{ printf "%s%s",$0,RT } END{print""}' file
four five
six seven
eight nine
ten
Note that this preserves the white space and line breaks of the the original file.
-v RS='[[:space:]]+'
This sets awk's record separator to any combination of white space.
4<=NR && NR<=10{ printf "%s%s",$0,RT }
For records 4 through 10, this prints the record with whatever whitespace followed it in the input file. RT
is not POSIX.
END{print""}
This prints a final newline which is needed if the final word was not the last on a line.
Upvotes: 3
Reputation: 437208
Assuming that:
try:
awk -v from=10000 -v to=20000 -v RS='[[:space:]]+' 'NR < from {next} NR > to {exit} 1' file
- Simply omit -v from=...
to start with the first word.
- This solution prints each word on its own line on output; if, by contrast, you want to preserve the original whitespace between words, see John1024's helpful answer.
RS='[[:space:]]+'
defines the input-record separator (RS
) as any run of whitespace, which effectively makes each run of non-whitespace characters its own record.
RS
value that makes this solution non-POSIX-compliant; BSD awk
, as also used on OS X, stays close to the POSIX spec. and therefore doesn't support such an RS
value.NR < from {next}
skips input records as long as their 1-based record index NR
is below the start index of the range.
NR > to {exit}
exits altogether once the record index exceeds the end index of the range. This can be an important optimization with large input files.
1
, a common shorthand for { print }
, prints each word on its own line, because print
prints each input record followed by the value of ORS
, the output-record separator, which defaults to \n
.
Caveat: A run of whitespace preceding the first word is reported as an empty word (record).
Upvotes: 1
Reputation: 67467
awk
to the rescue!
this should work with other awks too
$ awk -v n=15 'sum<n && p{print p}
{p=$0; sum+=NF}
sum>=n{exit}
END{for(i=1;i<=n-sum+NF;i++) printf "%s ", $i;
print ""}' file.txt
this is the first n words script. Range can be implemented similarly.
Upvotes: 1