Reputation: 9071
I'm trying to extract a certain (the fourth) field from the column-based, 'space'-adjusted text stream. I'm trying to use the cut
command in the following manner:
cat text.txt | cut -d " " -f 4
Unfortunately, cut
doesn't treat several spaces as one delimiter. I could have piped through awk
awk '{ printf $4; }'
or sed
sed -E "s/[[:space:]]+/ /g"
to collapse the spaces, but I'd like to know if there any way to deal with cut
and several delimiters natively?
Upvotes: 340
Views: 175096
Reputation: 289505
As you comment in your question, awk
is really the way to go. To use cut
is possible together with tr -s
to squeeze spaces, as kev's answer shows.
Let me however go through all the possible combinations for future readers. Explanations are at the Test section.
tr -s ' ' < file | cut -d' ' -f4
awk '{print $4}' file
while read -r _ _ _ myfield _
do
echo "forth field: $myfield"
done < file
sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' file
shopt -s extglob
file_contents="$(< file)"
cut -d' ' -f4 <<< "${file_contents// +( )/ }"
Given this file, let's test the commands:
$ cat a
this is line 1 more text
this is line 2 more text
this is line 3 more text
this is line 4 more text
$ cut -d' ' -f4 a
is
# it does not show what we want!
$ tr -s ' ' < a | cut -d' ' -f4
1
2 # this makes it!
3
4
$
$ awk '{print $4}' a
1
2
3
4
This reads the fields sequentially. By using _
we indicate that this is a throwaway variable as a "junk variable" to ignore these fields. This way, we store $myfield
as the 4th field in the file, no matter the spaces in between them.
$ while read -r _ _ _ a _; do echo "4th field: $a"; done < a
4th field: 1
4th field: 2
4th field: 3
4th field: 4
This catches three groups of spaces and no spaces with ([^ ]*[ ]*){3}
. Then, it catches whatever coming until a space as the 4th field, that it is finally printed with \1
.
$ sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' a
1
2
3
4
This is practically equivalent to the tr
solution, however, it eliminates the additional process by using bash built-ins. This first reads the contents of the file into the parameter "file_contents
", then uses shell parameter expansion to replace two or more consecutive spaces with one space, and finally uses a here string to supply the expansion to cut
's standard input.
The pattern used, +( )
, matches one space, then one or more spaces. It effectively matches two or more spaces. Note that, the +()
construct only works when the extglob shell option is enabled. Furthermore, the $(< f)
is a shorter and faster way to write $(cat f)
.
$ shopt -s extglob
$ file_contents="$(< file)"
$ cut -d' ' -f4 <<< "${file_contents// +( )/ }"
Upvotes: 107
Reputation: 390
I've implemented a patch that adds new -m
command-line option to cut(1)
, which works in the field mode and treats multiple consecutive delimiters as a single delimiter. This basically solves the OP's question in a rather efficient way, by treating several spaces as one delimiter right within cut(1)
.
In particular, with my patch applied, the following command will perform the desired operation. It's as simple as that, just add -m
into the command line:
cat text.txt | cut -d ' ' -m -f 4
I also submitted this patch upstream, and let's hope that it will eventually be accepted and merged into the coreutils project.
There are some further thoughts about adding even more whitespace-related features to cut(1)
, and having some feedback on all that from different people would be great, preferably on the coreutils mailing list. I'm willing to implement more patches for cut(1)
and submit them upstream, which would make this utility more versatile and more usable in various real-world scenarios.
Upvotes: 2
Reputation: 161614
Try:
tr -s ' ' <text.txt | cut -d ' ' -f4
From the tr
man page:
-s, --squeeze-repeats replace each input sequence of a repeated character that is listed in SET1 with a single occurrence of that character
Upvotes: 597
Reputation: 3451
This Perl one-liner shows how closely Perl is related to awk:
perl -lane 'print $F[3]' text.txt
However, the @F
autosplit array starts at index $F[0]
while awk fields start with $1
Upvotes: 4
Reputation: 5952
After becoming frustrated with the too many limitations of cut
, I wrote my own replacement, which I called cuts
for "cut on steroids".
cuts provides what is likely the most minimalist solution to this and many other related cut/paste problems.
One example, out of many, addressing this particular question:
$ cat text.txt
0 1 2 3
0 1 2 3 4
$ cuts 2 text.txt
2
2
cuts
supports:
paste
separately)and much more. None of which is provided by standard cut
.
See also: https://stackoverflow.com/a/24543231/1296044
Source and documentation (free software): http://arielf.github.io/cuts/
Upvotes: 27
Reputation: 79165
With versions of cut
I know of, no, this is not possible. cut
is primarily useful for parsing files where the separator is not whitespace (for example /etc/passwd
) and that have a fixed number of fields. Two separators in a row mean an empty field, and that goes for whitespace too.
Upvotes: 3