SIO
SIO

Reputation: 305

Replacing/removing excess white space between columns in a file

I am trying to parse a file with similar contents:

I am a string         12831928  
I am another string           41327318   
A set of strings      39842938  
Another string           3242342  

I want the out file to be tab delimited:

I am a string\t12831928  
I am another string\t41327318   
A set of strings\t39842938  
Another string\t3242342 

I have tried the following:

sed 's/\s+/\t/g' filename > outfile

I have also tried cut, and awk.

Upvotes: 0

Views: 3768

Answers (7)

Gyre
Gyre

Reputation: 41

Simple and without invisible semantic characters in the code:

    perl -lpe 's/\s+$//; s/\s\s+/\t/' filename

Explanation:

    Options:
      -l: remove LF during processing (in this case)
      -p: loop over records (like awk) and print
      -e: code follows

    Code:
      remove trailing whitespace
      change two or more whitespace to tab

Tested on OP data. The trailing spaces are removed for consistency.

Upvotes: 0

dawg
dawg

Reputation: 103834

You have trailing spaces on each line. So you can do two sed expressions in one go like so:

$ sed -E -e 's/ +$//' -e $'s/  +/\t/' /tmp/file  
I am a string   12831928
I am another string 41327318
A set of strings    39842938
Another string  3242342

Note the $'s/ +/\t/': This tells bash to replace \t with an actual tab character prior to invoking sed.

To show that these deletions and \t insertions are in the right place you can do:

$ sed -E -e 's/ +$/X/' -e $'s/  +/Y/' /tmp/file  
I am a stringY12831928X
I am another stringY41327318X
A set of stringsY39842938X
Another stringY3242342X

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203502

Just use awk:

$ awk -F'  +' -v OFS='\t' '{sub(/ +$/,""); $1=$1}1' file
I am a string   12831928
I am another string     41327318
A set of strings        39842938
Another string  3242342

Breakdown:

-F'  +'           # tell awk that input fields (FS) are separated by 2 or more blanks
-v OFS='\t'       # tell awk that output fields are separated by tabs
'{sub(/ +$/,"");  # remove all trailing blank spaces from the current record (line)
$1=$1}            # recompile the current record (line) replacing FSs by OFSs
1'                # idiomatic: any true condition invokes the default action of "print"

I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Upvotes: 4

karakfa
karakfa

Reputation: 67497

another approach, with gnu sed and rev

$ rev file | sed -r 's/ +/\t/1' | rev

Upvotes: 0

Benjamin W.
Benjamin W.

Reputation: 52132

Your input has spaces at the end of each line, which makes things a little more difficult than without. This sed command would replace the spaces before that last column with a tab:

$ sed 's/[[:blank:]]*\([^[:blank:]]*[[:blank:]]*\)$/\t\1/' infile | cat -A
I am a string^I12831928  $
I am another string^I41327318   $
A set of strings^I39842938  $
Another string^I3242342  $

This matches – anchored at the end of the line – blanks, non-blanks and again blanks, zero or more of each. The last column and the optional blanks after it are captured.

The blanks before the last column are then replaced by a single tab, and the rest stays the same – see output piped to cat -A to show explicit line endings and ^I for tab characters.

If there are no blanks at the end of each line, this simplifies to

sed 's/[[:blank:]]*\([^[:blank:]]*\)$/\t\1/' infile

Notice that some seds, notably BSD sed as found in MacOS, can't use \t for tab in a substitution. In that case, you have to use either '$'\t'' or '"$(printf '\t')"' instead.

Upvotes: 0

ChuckB
ChuckB

Reputation: 638

sed -E 's/[ ][ ]+/\\t/g' filename > outfile

NOTE: the [ ] is openBracket Space closeBracket

-E for extended regular expression support.

The double brackets [ ][ ]+ is to only substitute tabs for more than 1 consecutive space.

Tested on MacOS and Ubuntu versions of sed.

Upvotes: 0

David C. Rankin
David C. Rankin

Reputation: 84561

The difficulty comes in the varying number of words per-line. While you can handle this with awk, a simple script reading each word in a line into an array and then tab-delimiting the last word in each line will work as well:

#!/bin/bash

fn="${1:-/dev/stdin}"

while read -r line || test -n "$line"; do
    arr=( $(echo "$line") )
    nword=${#arr[@]}
    for ((i = 0; i < nword - 1; i++)); do
        test "$i" -eq '0' && word="${arr[i]}" || word=" ${arr[i]}"
        printf "%s" "$word"
    done
    printf "\t%s\n" "${arr[i]}"
done < "$fn"

Example Use/Output

(using your input file)

$ bash rfmttab.sh < dat/tabfile.txt
I am a string   12831928
I am another string     41327318
A set of strings        39842938
Another string  3242342

Each number is tab-delimited from the rest of the string. Look it over and let me know if you have any questions.

Upvotes: 0

Related Questions