Reputation: 45

Formatting Text Awk Sed

Hi I have a file that I need to put in a format that I can pull into excel spreadsheet I do not know how to do this and I would appreciate if you could help me out.

Here is the input sample

#1

Indiana University—Bloomington (Kelley) 
Bloomington, IN

90  58  82  86 
#1

Temple University (Fox) 
Philadelphia, PA

95  66  97  95 
#1

University of North Carolina—Chapel Hill (Kenan-Flagler) 
Chapel Hill, NC

73  58  100     75 
#4

Here is the output

#1, Indiana University—Bloomington (Kelley) Bloomington, IN,   90, 58, 82, 86,
#1, Temple University (Fox) Philadelphia, PA,           95,     66,     97,     95,

I'm using shell scripting in linux

Thanks

Upvotes: 1

Answers (5)

Thor

Reputation: 47099

Here is an alternative way of doing it with awk by only manipulating the output field separator (OFS) and the output record separator (ORS):

grep -v '^$' infile |      # remove empty lines
awk 'NR%4 { ORS=", "; OFS=" " } NR%4 == 0 { ORS="\n"; OFS=", " } $1=$1'

Output:

#1, Indiana University—Bloomington (Kelley), Bloomington, IN, 90, 58, 82, 86
#1, Temple University (Fox), Philadelphia, PA, 95, 66, 97, 95
#1, University of North Carolina—Chapel Hill (Kenan-Flagler), Chapel Hill, NC, 73, 58, 100, 75
#4,

Upvotes: 0

NeronLeVelu

Reputation: 10039

 sed '#n;/[0-9 ]/ s/  */, /g;/^ *$/d;H;$!b;g;s/.//;s/\n\([^#]\)/, \1/g;p' YourFile

remove and preformat entry line
hold the remaining info
at the end, load the buffer
remove first newline
replace any new line not followed by a # by a , and the followed char itself
print the result

if the last , is mandatory (normaly not in a csv/excel file) adapt the /[0-9 ]/ s/ */, /g with this ;/[0-9 ]/ {s/ */, /g; s/$/,/;}

Upvotes: 0

ShaneQful

Reputation: 2236

Awk script to solve the problem:

/^#[0-9]/ {current = $0}

/\([A-Za-z ]+\)/ { current = current "," $0}

/[A-Z]+$/ { current = current $0}

/^[0-9]+/ {current = current "," $1 "," $2 "," $3 "," $4; print current}

Usage:

cat yourdatafile | awk -f script.awk > output.csv

Explanation:

Each of the regexes match the patterns on the different lines and executes the action for that line beside the regex.

For the #number, initialize/overwrite a current variable with the #number.
For the text info without a state, add it to the current variable with a comma at the start
For the text info with a State, add it to the current variable without a comma at the start
For the list of numbers, add them to the current variable with a comma at the start and in between each then print the current variable

Upvotes: 1

Marcus Müller

Reputation: 36337

While it's entirely possible you can do that with a bit of awk scripting, I'd recommend you don't do that.

Actually, awk is handy for anything that's not too complex, but here, since you're already planning to use Excel, you might as well just import the plain file, and then process it in excel, pivoting, reshaping, splitting it there.

However, I hate Excels complexity, so here's my python2 approach (saving it as program.py and making it executable as chmod 755 program.py):

#!/usr/bin/python
import sys

wholefile = open(sys.argv[1], "r").read()
parts = wholefile.split("#")

for item in parts:
    lines = item.split("\n")
    output = [ int(lines[0]), lines[2], lines[3],lines[5].split() ]
    print ";".join(output)

and run this as

program.py input.txt > output.csv

EDIT: typo, and:

I tend to say this too often, but doing something in a shell script that isn't very centered on the wish to invoke a lot of commands is often far less effective than using any general purpose scripting language. Python is so abundant everywhere that I seldom find myself writing bash scripts.

EDIT2: Ok, so no python on your host. scary ;P. Use bash's built-in read function (man read).

Upvotes: 0

Wintermute

Reputation: 44023

This is rather simple with GNU awk and mawk if you don't try to use it in a line-based manner. We'll use a # at the beginning of a line as record separator and a newline as a field separator. Then:

awk -v RS='(^|\n)#' -F'\n' 'NR > 1 { gsub(/ +/, ", ", $6); print "#" $1 ", " $3 " " $4 ", " $6 }' filename

That is:

NR > 1 {                              # the first record is the empty bit before
                                      # the first separator, so we skip it
  gsub(/ +/, ", ", $6)                # then: insert commas in the number row
  print "#" $1 ", " $3 " " $4 ", " $6 # and reassemble the record in the right
                                      # format for printing.
}

The use of a regex as record separator is not strictly POSIX-conforming, but between gawk and mawk, you'll have most bases covered.

Upvotes: 3

Formatting Text Awk Sed

Answers (5)

Related Questions