Outputting values into csvs - command line

Question

New to this site and programming in general (biologist by background).

Anyways, I have a task which is to obtain a text files name, count unique lines, count total lines and output this into a csv file. This is the code I am using in Cygwin

#!/bin/bash
file=./data/*.txt
name= ls ./data > output.csv
unique= sort $file | uniq | wc -l >> output.csv
total= cat $file | wc -l >> output.csv
nano output.csv

I get all the correct outputs, my questions are:

Can I choose in which column each value is entered? At the moment they are added directly underneath each other.
Is there a more efficient way of adding the outputs to the output file?

Thank you!

Fran

Jonathan Leffler · Accepted Answer

There are numerous improvements to make to the existing code, which is:

#!/bin/bash
file=./data/*.txt
name= ls ./data > output.csv
unique= sort $file | uniq | wc -l >> output.csv
total= cat $file | wc -l >> output.csv
nano output.csv

The three lines that write to output.csv carefully set environment variables name, unique and total to empty strings and then run commands — which isn't precisely wrong, but really isn't what you had in mind, either. The sort | uniq can be simplified to sort -u. There's no need for the cat $file | wc -l when wc -l < $file will do the same job with one fewer processes. The ls line is generating the same names as the wild card expansion does. You've got some issues with one file at a time vs all files together, too.

If you want a CSV file with name, unique lines, and total lines for each file, then we'd expect to see a loop in the code.

for file in ./data/*.txt
do
    unique=$(sort -u $file | wc -l)
    total=$(wc -l < $file)
    echo "$file,$unique,$total"
done

This runs sort -u to sort uniquely (no need for the explicit uniq), and captures the output from wc -l. It runs wc -l with its standard input from the file for the total line count; using the I/O redirection stops wc from printing the file name. The echo then prints the data. If you only want the base name of the file (just xyz.txt and not ./data/xyz.txt), then you can fix that in the echo:

echo "$(basename $file),$unique,$total"

or:

echo "${file##*/},$unique,$total"

The only possible downside to this is that it runs the commands once per file, which could be a bit of a problem if there are lots of files. However, this will work — get it right first, and only then, if there is a speed problem, spend time on optimizing it.

Outputting values into csvs - command line

Answers (2)

Related Questions