Madivad
Madivad

Reputation: 3337

How to count new lines using grep in ubuntu

(one final note, at the start of the question: I have solved this before asking, scroll to the end)

I'm in the process of trying to parse a large file, and before I make changes I thought I would run some "simple" tests to confirm I was getting the desired output, but I'm coming up short.

here is a capture of the file format:

00000030  32 2e 31 2e 30 65 2c 0d  0a 43 4c 49 45 4e 54 5f  |2.1.0e,..CLIENT_|
00000040  44 45 4d 4f 2c 31 2c 31  2c 22 4c 4b 44 55 41 32  |DEMO,1,1,"LKDUA2|

What I want to do is convert all the newlines \x0d\x0a or \r\n into something else I was using \x09 or \t for this purpose, so that I could re-parse it and only convert SOME of those back to new lines.

I realise there's probably better ways to do this, but I was trying to work with what I already (thought I) knew.

first I ran some trials:

tr -s '\r\n' '\t' < orig > o.rnt
tr -s '\n' '\t' < orig > o.nt
tr -s '\r' '\t' < orig > o.rt

and files sizes:

$ ls -l o*
-rw-r----- 1 madivad madivad 620519 Oct 30 09:41 orig
-rw-rw-r-- 1 madivad madivad 620519 Oct 30 09:26 o.nt
-rw-rw-r-- 1 madivad madivad 620519 Oct 30 09:26 o.rt
-rw-rw-r-- 1 madivad madivad 615271 Oct 30 09:40 o.rnt

These results are as expected. the difference being 5248 which is the number of newlines. So far, so good.

what happened to the extra tab

I added one more test and things weren't as expected:

tr -s '\r\n' '\t\t' < orig > o.rntt

-rw-rw-r-- 1 madivad madivad 615271 Oct 30 09:40 o.rntt

I was expecting 620519 but a hexdump confirms only 1x \t was added back

00000030  32 2e 31 2e 30 65 2c 09  43 4c 49 45 4e 54 5f 44  |2.1.0e,.CLIENT_D|

(note: this (Q1) is more an incidental question, I only discovered this when confirming everything to ask this question, my REAL questions are below)

How to correctly test for or count 'newline'

In running my tests, I wanted to count the occurrences of newline's and I confirmed this several ways, resulting in the correct 5248... for SOME results. It seems the \n is not parsed correctly.

$ grep -c ^ orig
5248
$ grep -c -P '\r' orig
5248
$ grep -c -P '\r' o.rt
5248
$ grep -c -P '\x0d' o.rt
5248
$ grep -c -P '\t' o.rnt
1
$ grep -c -P '\n' orig
0
$ grep -c -P '\x0a' orig
0
$ grep -c -P '\r\n' orig
0

Confirmation of conversion and testing

$ hexdump -C -s 48 -n 32 orig
00000030  32 2e 31 2e 30 65 2c 0d  0a 43 4c 49 45 4e 54 5f  |2.1.0e,..CLIENT_|

$ hexdump -C -s 48 -n 32 o.rt
00000030  32 2e 31 2e 30 65 2c 09  0a 43 4c 49 45 4e 54 5f  |2.1.0e,..CLIENT_|

$ hexdump -C -s 48 -n 32 o.nt
00000030  32 2e 31 2e 30 65 2c 0d  09 43 4c 49 45 4e 54 5f  |2.1.0e,..CLIENT_|

$ hexdump -C -s 48 -n 32 o.rnt
00000030  32 2e 31 2e 30 65 2c 09  43 4c 49 45 4e 54 5f 44  |2.1.0e,.CLIENT_D|

In the case of the output files, tr '\r\n' '\t' < orig > o.rnt seems to do the job right, but my grep for testing it is wrong:

$ hexdump -C -n 600 o.rnt | grep -P ' 09 '
00000030  32 2e 31 2e 30 65 2c 09  43 4c 49 45 4e 54 5f 44  |2.1.0e,.CLIENT_D|
00000110  2c 22 22 2c 31 2c 2c 09  43 4c 49 45 4e 54 5f 41  |,"",1,,.CLIENT_A|
000001a0  22 22 2c 30 2c 22 22 2c  09 43 4c 49 45 4e 54 5f  |"",0,"",.CLIENT_|
00000200  73 65 2c 46 61 6c 73 65  2c 30 2c 09 43 4c 49 45  |se,False,0,.CLIE|
00000230  31 2c 09 43 4c 49 45 4e  54 5f 43 4e 53 4e 54 2c  |1,.CLIENT_CNSNT,|

$ grep -c -P '\t' o.rnt
1

and where I've used: tr -s '\n' '\t' < orig > o.nt it also has appeared to work, again my test is wrong:

$ hexdump -C -n 600 o.nt | grep -P ' 09 '
00000030  32 2e 31 2e 30 65 2c 0d  09 43 4c 49 45 4e 54 5f  |2.1.0e,..CLIENT_|
00000110  30 2c 22 22 2c 31 2c 2c  0d 09 43 4c 49 45 4e 54  |0,"",1,,..CLIENT|
000001a0  22 2c 22 22 2c 30 2c 22  22 2c 0d 09 43 4c 49 45  |","",0,"",..CLIE|
00000200  46 61 6c 73 65 2c 46 61  6c 73 65 2c 30 2c 0d 09  |False,False,0,..|
00000230  2c 31 32 30 31 2c 0d 09  43 4c 49 45 4e 54 5f 43  |,1201,..CLIENT_C|

$ grep -c -P '\t' o.nt
1

Thanks

I don't want to move ahead until I understand where I'm going wrong, so that I don't further exacerbate the problem :)

I worked it out

As stated above, I actually worked it out, but can now ask:

1. Is there a better way?

This is the test I came up with, I would be happy for any improvements:

$ grep -o -P '\t' o.nt | wc -l
5249

Oh yeah, and there is one extra \t because there IS actually one extra tab in the file (long story)

Looking at it in retrospect, how would I count it using hexdump being mindful of line crossings ? ie to count or display 0D 0A

Upvotes: 1

Views: 59

Answers (1)

Madivad
Madivad

Reputation: 3337

In doing my FINAL test, I FINALLY GOT IT

I was all set to post this question, and as has happened to me numerous times in the past, asking a stackexchange question has resulted in me learning the answer before I've posted it.

I've been at this for over an hour now, but see the error of my ways. I am still posting this because it took me all this time to learn, and maybe it could prevent that for others :/

I was forgetting that grep -c will COUNT THE NUMBER OF LINES and by removing the newline character, I will only have one line in the file :(

I came up with this test:

$ grep -o -P '\t' o.nt | wc -l
5249

Upvotes: 1

Related Questions