Reputation: 2829
I have a CSV file that looks like this
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
I need to sort it by line length including spaces. The following command doesn't include spaces, is there a way to modify it so it will work for me?
cat $@ | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'
Upvotes: 214
Views: 138573
Reputation: 5113
< testfile awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-
Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:
< testfile awk '{ print length, $0 }' | sort -n | cut -d" " -f2-
In both cases, we have solved your stated problem by moving away from awk for your final cut.
The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s
(--stable
) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.
(Those who want more control of sorting these ties might look at sort's --key
option.)
It is interesting to note the difference between:
echo "hello awk world" | awk '{print}'
echo "hello awk world" | awk '{$1="hello"; print}'
They yield respectively
hello awk world
hello awk world
The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:
"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"
$1 = $1 # force record to be reconstituted
print $0 # or whatever else with $0
"This forces awk to rebuild the record."
aa A line with MORE spaces
bb The very longest line in the file
ccb
9 dd equal len. Orig pos = 1
500 dd equal len. Orig pos = 2
ccz
cca
ee A line with some spaces
1 dd equal len. Orig pos = 3
ff
5 dd equal len. Orig pos = 4
g
Upvotes: 317
Reputation: 32315
In the vain of "prefix line with its length and feed it to sort -n
", here's a "bash native" solution (no awk, perl, or shudder python):
cat FILE | while read line;
do echo "${#line}:$line";
done | sort -n | cut -d: -f2-
Upvotes: 2
Reputation: 1165
Here's a Python one-liner that does the same, tested with Python 3.9.10 and 2.7.18. It's about 60% faster than Caleb's perl solution, and the output is identical (tested with a 300MiB wordlist file with 14.8 million lines).
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
Benchmark:
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
real 0m5.308s
user 0m3.733s
sys 0m1.490s
perl -e 'print sort { length($a) <=> length($b) } <>'
real 0m8.840s
user 0m7.117s
sys 0m2.279s
Upvotes: 9
Reputation: 3451
Below are the results of a benchmark across solutions from other answers to this question.
perl
solution took 11.2 secondsperl
solution took 11.6 secondsawk
solution #1 took 20 secondsawk
solution #2 took 23 secondsawk
solution took 24 secondsawk
solution took 25 secondsbash
solution takes 400x longer than the awk
solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.perl
solutionperl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file
Upvotes: 23
Reputation: 535
Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE):
cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1
Upvotes: 2
Reputation: 2289
using Raku (formerly known as Perl6)
~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
To reverse the sort, add .reverse
in the middle of the chain of method calls--immediately after .sort()
. Here's code showing that .chars
includes spaces:
~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0
Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:
~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
real 0m1.308s
user 0m1.213s
sys 0m0.173s
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2- > /dev/null
real 0m1.189s
user 0m1.170s
sys 0m0.050s
HTH.
Upvotes: 3
Reputation: 107
1) pure awk solution. Let's suppose that line length cannot be more > 1024 then
cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'
2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:
LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1
Upvotes: 3
Reputation: 10516
Here is a multibyte-compatible method of sorting lines by length. It requires:
wc -m
is available to you (macOS has it).LC_ALL=UTF-8
. You can set this either in your .bash_profile, or simply by prepending it before the following command.testfile
has a character encoding matching your locale (e.g., UTF-8).Here's the full command:
cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
Explaining part-by-part:
l=$0; gsub(/\047/, "\047\"\047\"\047", l);
← makes of a copy of each line in awk variable l
and double-escapes every '
so the line can safely be echoed as a shell command (\047
is a single-quote in octal notation).cmd=sprintf("echo \047%s\047 | wc -m", l);
← this is the command we'll execute, which echoes the escaped line to wc -m
.cmd | getline c;
← executes the command and copies the character count value that is returned into awk variable c
.close(cmd);
← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.sub(/ */, "", c);
← trims white space from the character count value returned by wc
.{ print c, $0 }
← prints the line's character count value, a space, and the original line.| sort -ns
← sorts the lines (by prepended character count values) numerically (-n
), and maintaining stable sort order (-s
).| cut -d" " -f2-
← removes the prepended character count values.It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.
Alternatively, just do this solely with gawk
(as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).
Upvotes: 2
Reputation: 1
With POSIX Awk:
{
c = length
m[c] = m[c] ? m[c] RS $0 : $0
} END {
for (c in m) print m[c]
}
Upvotes: 3
Reputation: 5428
The AWK solution from neillb is great if you really want to use awk
and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort()
function with a custom caparison routine to iterate over the input lines. Here is a one liner:
perl -e 'print sort { length($a) <=> length($b) } <>'
You can put this in your pipeline wherever you need it, either receiving STDIN (from cat
or a shell redirect) or just give the filename to perl as another argument and let it open the file.
In my case I needed the longest lines first, so I swapped out $a
and $b
in the comparison.
Upvotes: 67
Reputation: 8741
I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort
the -g
(general-numeric-sort) flag instead of -n
(numeric-sort):
awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-
Upvotes: 3
Reputation: 785068
Try this command instead:
awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
Upvotes: 18
Reputation: 753665
The length()
function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).
awk '{ printf "%d:%s\n", length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]*://'
The sed
command directly removes the digits and colon added by the awk
command. Alternatively, keeping your formatting from awk
:
awk '{ print length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]* //'
Upvotes: 6
Reputation: 17188
Pure Bash:
declare -a sorted
while read line; do
if [ -z "${sorted[${#line}]}" ] ; then # does line length already exist?
sorted[${#line}]="$line" # element for new length
else
sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
fi
done < data.csv
for key in ${!sorted[*]}; do # iterate over existing indices
echo -e "${sorted[$key]}" # echo lines with equal length
done
Upvotes: 8