Francisco Sadras
Francisco Sadras

Reputation: 33

Outputting values into csvs - command line

New to this site and programming in general (biologist by background).

Anyways, I have a task which is to obtain a text files name, count unique lines, count total lines and output this into a csv file. This is the code I am using in Cygwin

#!/bin/bash
file=./data/*.txt
name= ls ./data > output.csv
unique= sort $file | uniq | wc -l >> output.csv
total= cat $file | wc -l >> output.csv
nano output.csv

I get all the correct outputs, my questions are:

  1. Can I choose in which column each value is entered? At the moment they are added directly underneath each other.

  2. Is there a more efficient way of adding the outputs to the output file?

Thank you!

Fran

Upvotes: 2

Views: 719

Answers (2)

ghoti
ghoti

Reputation: 46856

Nobody can compete with Jonathan Leffler, but the following gawk script also handles your requirements. It's a little more code, but in cases with multiple files, it may work more efficiently than a shell script.

#!/usr/local/bin/gawk -f

function show() {
  print last,length(unique),total;
  last=FILENAME;
  delete(unique);
  total=0;
}

BEGIN {
  OFS=",";
}

NR==1 {
  last=FILENAME;
}

FILENAME != last {
  show();
}

{
  total++; unique[$0];
}

END {
  show();
}

The only novel thing here is the use of the unique[] array. Since awk's arrays are all associative, using $0 as a key makes an array whose length is the number of unique lines. And merely making reference to an array element causes it to exist, so you don't actually need to set unique[$0] to anything.

To use the script, you'd use a command line like the following:

$ ./script.sh one.txt two.txt > output.csv

Or alternately something like

$ ./script.sh *.txt > output.csv

Note that in Cygwin, you may need to install the gawk package explicitly, and you will need to adjust the path to gawk in the first line of the script. You can type which gawk to see if it is already installed, and if so, where it lives on your system.

Upvotes: 2

Jonathan Leffler
Jonathan Leffler

Reputation: 753870

There are numerous improvements to make to the existing code, which is:

#!/bin/bash
file=./data/*.txt
name= ls ./data > output.csv
unique= sort $file | uniq | wc -l >> output.csv
total= cat $file | wc -l >> output.csv
nano output.csv

The three lines that write to output.csv carefully set environment variables name, unique and total to empty strings and then run commands — which isn't precisely wrong, but really isn't what you had in mind, either. The sort | uniq can be simplified to sort -u. There's no need for the cat $file | wc -l when wc -l < $file will do the same job with one fewer processes. The ls line is generating the same names as the wild card expansion does. You've got some issues with one file at a time vs all files together, too.

If you want a CSV file with name, unique lines, and total lines for each file, then we'd expect to see a loop in the code.

for file in ./data/*.txt
do
    unique=$(sort -u $file | wc -l)
    total=$(wc -l < $file)
    echo "$file,$unique,$total"
done

This runs sort -u to sort uniquely (no need for the explicit uniq), and captures the output from wc -l. It runs wc -l with its standard input from the file for the total line count; using the I/O redirection stops wc from printing the file name. The echo then prints the data. If you only want the base name of the file (just xyz.txt and not ./data/xyz.txt), then you can fix that in the echo:

echo "$(basename $file),$unique,$total"

or:

echo "${file##*/},$unique,$total"

The only possible downside to this is that it runs the commands once per file, which could be a bit of a problem if there are lots of files. However, this will work — get it right first, and only then, if there is a speed problem, spend time on optimizing it.

Upvotes: 3

Related Questions