Reputation: 353

Shell script to extract data from text file

I have made a shell script that is supposed to extract data with certain field names and put them in a CSV file.

An example input file may have the following lines:

                  user_name: [email protected]
                      EMAIL: [email protected]
                 FIRST_NAME: jonathan
                  LAST_NAME: doestein
              CREATION_DATE: 2013-08-01 01:08:52
        REGISTRATION_STATUS: Y
                     VENDOR: vendorname

This will repeat itself 'n' times.

This is an excerpt of the script I wrote so far:

#!/bin/sh

echo "Please enter input file name."
read input_variable
echo "You entered: $input_variable"

echo "Please enter a name of the new output file."
read output_file
touch $output_file
echo "The output file name is going to be $output_file"

echo "Extracting files..."  ;

awk '$1 ~ /^(user_name:|EMAIL:|FIRST_NAME:|LAST_NAME:|CREATION_DATE:|REGISTRATION_STATUS:)$/{printf "%s,",$2} $1 ~ /REGISTRATION_STATUS:/{print $2}' $input_variable >> $output_file.ib ;

However, although data prints to my output file, which must be a .csv extension for a GUI to view, when I open the file in a GUI such as OpenOffice Calc, there are many rows concatenated in the same row, while other lines appear to start a new line like they are supposed to.

For example, the one line might look like the following:

[email protected],noreally51,noway,username,username...x40 or so

usnername,username,username.... what this means is that it just lists about 40-50 usernames all in one row, then goes to the next line finally and prints information.

I would like to add column names to the output file:

VENDOR,user_name,FIRST_NAME,LAST_NAME,CREATION_DATE,REGISTRATION_STATUS

I can't figure out how to do that.

Thank you for your time and all of your support!

I edited my script as follows:

#!/bin/sh

echo "Please enter input file name."
read input_variable
echo "You entered: $input_variable"

echo "Please enter a name of the new output file."
touch output_file
read $output_file
echo "The output file name is going to be $output_file"

echo "Processing data extraction..." ;

awk -F": " n=25 -v 'NR<=n {h[NR-1]=$1} {a[NR%n-1]=$2} $1~/VENDOR/ && !hp{for(k=0;k<n;k++) printf "%s ", h[k] $input_variable && print "";hp=1} $1~/VENDOR/{for(k=0;k<n;k++) printf "%s ", a[k] && print ""}' data | column -t $input_variable ;

echo "Done."

This at least prints data to the $output_file. However, the data in the $output_file looks like:

??ࡱ?;?? ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????Root Entry????????????????????????????????????????????????????????????????

@karakfa

This is the contents of the script I have. I noticed that more than the first line of your script in your answer changed. So, I amended my script to the following:

#!/bin/sh

echo "Please enter input file name."
read input_variable
echo "You entered: $input_variable"

echo "Please enter a name of the new output file."
touch output_file
read $output_file
echo "The output file name is going to be ${output_file}"

echo "Processing data extraction..." ;

cat $input_variable | awk -F": " -v OFS="," -v n=25
  'NR<=n{sub(/^ */,"",$1);h[NR-1]=$1}
        {a[(NR-1)%n]=$2}
$1~/VENDOR/ && !hp{line=h[0];
                  for(k=1;k<n;k++) line=line OFS h[k];
                  print line;hp=1
                 }
      $1~/VENDOR/{line=a[0];
                  for(k=1;k<n;k++) line=line OFS a[k];
                  print line}' $input_variable ;
echo "Done."

The output was:

Please enter input file name.
inputfile.txt
You entered: allgmail.com_accounts.txt
Please enter a name of the new output file.
outputfile.csv
The output file name is going to be 
Processing data extraction...
awk: no program given

./scriptname: line 23: NR<=n{sub(/^ */,"",$1);h[NR-1]=$1} 
          {a[(NR-1)%n]=$2} 
  $1~/VENDOR/ && !hp{line=h[0]; 
                    for(k=1;k<n;k++) line=line OFS h[k];
                    print line;hp=1
                   }  
        $1~/VENDOR/{line=a[0];
                    for(k=1;k<n;k++) line=line OFS a[k];
                    print line}: No such file or directory
Done.

I did not find any articles about 'awk: no program given' error. Do you know what I am doing incorrectly?

I noticed that where it says 'line 23', so line 23 is the following:

 print line}' $input_variable ;

Then, I noticed that it also says the following on the last line:

print line}: No such file or directory

This occurs with or without 'cat $input_variable |' before awk. Normally, awk works fine on my OS. It is a Mac 10.11.1 (15B42). Is #!/bin/sh incorrect?

I look forward to your thoughts. Thank you!

Upvotes: 0

Answers (2)

karakfa

Reputation: 67467

If all your fields are always present, you can try the following awk script. The number of fields is set as a variable (7 in this case) and "VENDOR" is used as last field of the record indicator.

UPDATE: didn't notice the csv output

$ awk -F": " -v OFS="," -v n=7 
    'NR<=n{sub(/^ */,"",$1);h[NR-1]=$1} 
          {a[(NR-1)%n]=$2} 
 $1~/VENDOR/ && !hp{line=h[0]; 
                    for(k=1;k<n;k++) line=line OFS h[k];
                    print line;hp=1
                   }  
        $1~/VENDOR/{line=a[0];
                    for(k=1;k<n;k++) line=line OFS a[k];
                    print line}' inputfilename


user_name,EMAIL,FIRST_NAME,LAST_NAME,CREATION_DATE,REGISTRATION_STATUS,VENDOR
[email protected],[email protected],jonathan,doestein,2013-08-01 01:08:52,Y,vendorname

Building the header during the first n lines, when done print header once and each record when the final field is seen.

to move the last field to first you can change the code as

line=h[n-1]; 
for(k=1;k<n-1;k++) line=line OFS h[k];

for both occurrences (change the array name from "h" to "a" in the second instance).

Upvotes: 2

user3131905

Reputation:

why dont you use echo before awk ?

echo ENDOR,user_name,FIRST_NAME,LAST_NAME,CREATION_DATE,REGISTRATION_STATUS > file

Upvotes: 2

Shell script to extract data from text file

Answers (2)

Related Questions