Mike
Mike

Reputation: 91

How to remove nonnumeric junk from a file

Here's an output from less:

487451
487450<A3><BA>1<A3><BA>1
487449<A3><BA>1<A3><BA>1
487448<A3><BA>1<A3><BA>1
487447<A3><BA>1<A3><BA>1
487446<A3><BA>1<A3><BA>1
487445<A3><BA>1<A3><BA>1
484300<A3><BA>1<A3><BA>1
484299<A3><BA>1<A3><BA>1
484297<A3><BA>1<A3><BA>1
484296<A3><BA>1<A3><BA>1
484295<A3><BA>1<A3><BA>1
484294<A3><BA>1<A3><BA>1
484293<A3><BA>1<A3><BA>1
483496
483495
483494
483493
483492
483491

I see a bunch of nonprintable characters here. How do I remove them using sed/tr?

My try was 's/\([0-9][0-9]*\)/\1/g', but it doesn't work.

EDIT: Okay, let's go further down the source. The numbers are extracted from this file:

487451"><img src="Manage/pic/20100901/Adidas running-429.JPG" alt="Adidas running-429" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
487450"><img src="Manage/pic/20100901/Adidas fs 1<A3><BA>1-060.JPG" alt="Adidas fs 1<A3><BA>1-060" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>

The first line is perfectly normal and what most of the lines are. The second is "corrupted". I'd just like to extract the number at the beginning (using 's/\([0-9][0-9]*\).*/\1/g', but somehow the nonprintables get into the regex, which should stop at ".

EDIT II: Here's a clarification: There are no brackets in the text file. These are character codes of nonprintable characters. The brackets are there because I copied the file from less. Mac's Terminal, on the other hand, uses ?? to represent such characters. I bet xterm on my Ubuntu would print that white oval with a question mark.

Upvotes: 2

Views: 5904

Answers (5)

user2461982
user2461982

Reputation: 1

If the data always is like the sample, deleting from the less-than to the end of the line would work fine. sed -i "s/<.*$//" file

Upvotes: -2

josh.trow
josh.trow

Reputation: 4901

If you know the crap will always be inside brackets, why not delete that crap?

sed 's/<[^>]*>//g'

EDIT: Thanks, Mike that makes sense. In that case, how about:

sed 's/([0-9]+).*/\1/g'

Upvotes: 0

anubhava
anubhava

Reputation: 785146

Try this sed command:

sed 's/^\([0-9][0-9]*\).*$/\1/' file.txt

OUTPUT (running above command on the input file you provided)

487451
487450
487449
487448
487447
487446
487445
484300
484299
484297
484296
484295
484294
484293
483496
483495
483494
483493
483492
483491

Upvotes: 0

Jonathan Leffler
Jonathan Leffler

Reputation: 753725

Classic job for either sed's or Unix's tr command.

sed 's/[^0-9]//g' $file

(Anything that is not a digit - or newline - is deleted.)

tr -cd '0-9\012' < $file > $file.1

Delete (-d) the complement (-c) of the digits and newline...

Upvotes: 7

deong
deong

Reputation: 3870

You missed the bit where you match the rest of the line.

sed 's/\([0-9][0-9]*\)[^0-9]*/\1/g' 
                      ^^^^^^^

Upvotes: 2

Related Questions