Reputation: 33
New to this site and programming in general (biologist by background).
Anyways, I have a task which is to obtain a text files name, count unique lines, count total lines and output this into a csv file. This is the code I am using in Cygwin
#!/bin/bash
file=./data/*.txt
name= ls ./data > output.csv
unique= sort $file | uniq | wc -l >> output.csv
total= cat $file | wc -l >> output.csv
nano output.csv
I get all the correct outputs, my questions are:
Can I choose in which column each value is entered? At the moment they are added directly underneath each other.
Is there a more efficient way of adding the outputs to the output file?
Thank you!
Fran
Upvotes: 2
Views: 719
Reputation: 46856
Nobody can compete with Jonathan Leffler, but the following gawk script also handles your requirements. It's a little more code, but in cases with multiple files, it may work more efficiently than a shell script.
#!/usr/local/bin/gawk -f
function show() {
print last,length(unique),total;
last=FILENAME;
delete(unique);
total=0;
}
BEGIN {
OFS=",";
}
NR==1 {
last=FILENAME;
}
FILENAME != last {
show();
}
{
total++; unique[$0];
}
END {
show();
}
The only novel thing here is the use of the unique[]
array. Since awk's arrays are all associative, using $0
as a key makes an array whose length is the number of unique lines. And merely making reference to an array element causes it to exist, so you don't actually need to set unique[$0]
to anything.
To use the script, you'd use a command line like the following:
$ ./script.sh one.txt two.txt > output.csv
Or alternately something like
$ ./script.sh *.txt > output.csv
Note that in Cygwin, you may need to install the gawk
package explicitly, and you will need to adjust the path to gawk in the first line of the script. You can type which gawk
to see if it is already installed, and if so, where it lives on your system.
Upvotes: 2
Reputation: 753870
There are numerous improvements to make to the existing code, which is:
#!/bin/bash
file=./data/*.txt
name= ls ./data > output.csv
unique= sort $file | uniq | wc -l >> output.csv
total= cat $file | wc -l >> output.csv
nano output.csv
The three lines that write to output.csv
carefully set environment variables name
, unique
and total
to empty strings and then run commands — which isn't precisely wrong, but really isn't what you had in mind, either. The sort | uniq
can be simplified to sort -u
. There's no need for the cat $file | wc -l
when wc -l < $file
will do the same job with one fewer processes. The ls
line is generating the same names as the wild card expansion does. You've got some issues with one file at a time vs all files together, too.
If you want a CSV file with name, unique lines, and total lines for each file, then we'd expect to see a loop in the code.
for file in ./data/*.txt
do
unique=$(sort -u $file | wc -l)
total=$(wc -l < $file)
echo "$file,$unique,$total"
done
This runs sort -u
to sort uniquely (no need for the explicit uniq
), and captures the output from wc -l
. It runs wc -l
with its standard input from the file for the total line count; using the I/O redirection stops wc
from printing the file name. The echo then prints the data. If you only want the base name of the file (just xyz.txt
and not ./data/xyz.txt
), then you can fix that in the echo
:
echo "$(basename $file),$unique,$total"
or:
echo "${file##*/},$unique,$total"
The only possible downside to this is that it runs the commands once per file, which could be a bit of a problem if there are lots of files. However, this will work — get it right first, and only then, if there is a speed problem, spend time on optimizing it.
Upvotes: 3