mbaitoff
mbaitoff

Reputation: 9071

How to make the 'cut' command treat same sequental delimiters as one?

I'm trying to extract a certain (the fourth) field from the column-based, 'space'-adjusted text stream. I'm trying to use the cut command in the following manner:

cat text.txt | cut -d " " -f 4

Unfortunately, cut doesn't treat several spaces as one delimiter. I could have piped through awk

awk '{ printf $4; }'

or sed

sed -E "s/[[:space:]]+/ /g"

to collapse the spaces, but I'd like to know if there any way to deal with cut and several delimiters natively?

Upvotes: 340

Views: 175096

Answers (6)

fedorqui
fedorqui

Reputation: 289505

As you comment in your question, awk is really the way to go. To use cut is possible together with tr -s to squeeze spaces, as kev's answer shows.

Let me however go through all the possible combinations for future readers. Explanations are at the Test section.

tr | cut

tr -s ' ' < file | cut -d' ' -f4

awk

awk '{print $4}' file

bash

while read -r _ _ _ myfield _
do
   echo "forth field: $myfield"
done < file

sed

sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' file

bash (Parameter Expansion)

shopt -s extglob
file_contents="$(< file)"
cut -d' ' -f4 <<< "${file_contents// +( )/ }"

Tests

Given this file, let's test the commands:

$ cat a
this   is    line     1 more text
this      is line    2     more text
this    is line 3     more text
this is   line 4            more    text

tr | cut

$ cut -d' ' -f4 a
is
                        # it does not show what we want!


$ tr -s ' ' < a | cut -d' ' -f4
1
2                       # this makes it!
3
4
$

awk

$ awk '{print $4}' a
1
2
3
4

bash

This reads the fields sequentially. By using _ we indicate that this is a throwaway variable as a "junk variable" to ignore these fields. This way, we store $myfield as the 4th field in the file, no matter the spaces in between them.

$ while read -r _ _ _ a _; do echo "4th field: $a"; done < a
4th field: 1
4th field: 2
4th field: 3
4th field: 4

sed

This catches three groups of spaces and no spaces with ([^ ]*[ ]*){3}. Then, it catches whatever coming until a space as the 4th field, that it is finally printed with \1.

$ sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' a
1
2
3
4

bash (Parameter Expansion)

This is practically equivalent to the tr solution, however, it eliminates the additional process by using bash built-ins. This first reads the contents of the file into the parameter "file_contents", then uses shell parameter expansion to replace two or more consecutive spaces with one space, and finally uses a here string to supply the expansion to cut's standard input.

The pattern used, +( ), matches one space, then one or more spaces. It effectively matches two or more spaces. Note that, the +() construct only works when the extglob shell option is enabled. Furthermore, the $(< f) is a shorter and faster way to write $(cat f).

$ shopt -s extglob
$ file_contents="$(< file)"
$ cut -d' ' -f4 <<< "${file_contents// +( )/ }"

Upvotes: 107

dsimic
dsimic

Reputation: 390

I've implemented a patch that adds new -m command-line option to cut(1), which works in the field mode and treats multiple consecutive delimiters as a single delimiter. This basically solves the OP's question in a rather efficient way, by treating several spaces as one delimiter right within cut(1).

In particular, with my patch applied, the following command will perform the desired operation. It's as simple as that, just add -m into the command line:

cat text.txt | cut -d ' ' -m -f 4

I also submitted this patch upstream, and let's hope that it will eventually be accepted and merged into the coreutils project.

There are some further thoughts about adding even more whitespace-related features to cut(1), and having some feedback on all that from different people would be great, preferably on the coreutils mailing list. I'm willing to implement more patches for cut(1) and submit them upstream, which would make this utility more versatile and more usable in various real-world scenarios.

Upvotes: 2

kev
kev

Reputation: 161614

Try:

tr -s ' ' <text.txt | cut -d ' ' -f4

From the tr man page:

-s, --squeeze-repeats   replace each input sequence of a repeated character
                        that is listed in SET1 with a single occurrence
                        of that character

Upvotes: 597

Chris Koknat
Chris Koknat

Reputation: 3451

This Perl one-liner shows how closely Perl is related to awk:

perl -lane 'print $F[3]' text.txt

However, the @F autosplit array starts at index $F[0] while awk fields start with $1

Upvotes: 4

arielf
arielf

Reputation: 5952

shortest/friendliest solution

After becoming frustrated with the too many limitations of cut, I wrote my own replacement, which I called cuts for "cut on steroids".

cuts provides what is likely the most minimalist solution to this and many other related cut/paste problems.

One example, out of many, addressing this particular question:

$ cat text.txt
0   1        2 3
0 1          2   3 4

$ cuts 2 text.txt
2
2

cuts supports:

  • auto-detection of most common field-delimiters in files (+ ability to override defaults)
  • multi-char, mixed-char, and regex matched delimiters
  • extracting columns from multiple files with mixed delimiters
  • offsets from end of line (using negative numbers) in addition to start of line
  • automatic side-by-side pasting of columns (no need to invoke paste separately)
  • support for field reordering
  • a config file where users can change their personal preferences
  • great emphasis on user friendliness & minimalist required typing

and much more. None of which is provided by standard cut.

See also: https://stackoverflow.com/a/24543231/1296044

Source and documentation (free software): http://arielf.github.io/cuts/

Upvotes: 27

Benoit
Benoit

Reputation: 79165

With versions of cut I know of, no, this is not possible. cut is primarily useful for parsing files where the separator is not whitespace (for example /etc/passwd) and that have a fixed number of fields. Two separators in a row mean an empty field, and that goes for whitespace too.

Upvotes: 3

Related Questions