Reputation: 91

How to remove nonnumeric junk from a file

Here's an output from less:

487451
487450<A3><BA>1<A3><BA>1
487449<A3><BA>1<A3><BA>1
487448<A3><BA>1<A3><BA>1
487447<A3><BA>1<A3><BA>1
487446<A3><BA>1<A3><BA>1
487445<A3><BA>1<A3><BA>1
484300<A3><BA>1<A3><BA>1
484299<A3><BA>1<A3><BA>1
484297<A3><BA>1<A3><BA>1
484296<A3><BA>1<A3><BA>1
484295<A3><BA>1<A3><BA>1
484294<A3><BA>1<A3><BA>1
484293<A3><BA>1<A3><BA>1
483496
483495
483494
483493
483492
483491

I see a bunch of nonprintable characters here. How do I remove them using sed/tr?

My try was 's/$[0-9][0-9]*$/\1/g', but it doesn't work.

EDIT: Okay, let's go further down the source. The numbers are extracted from this file:

487451"><img src="Manage/pic/20100901/Adidas running-429.JPG" alt="Adidas running-429" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
487450"><img src="Manage/pic/20100901/Adidas fs 1<A3><BA>1-060.JPG" alt="Adidas fs 1<A3><BA>1-060" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>

The first line is perfectly normal and what most of the lines are. The second is "corrupted". I'd just like to extract the number at the beginning (using 's/$[0-9][0-9]*$.*/\1/g', but somehow the nonprintables get into the regex, which should stop at ".

EDIT II: Here's a clarification: There are no brackets in the text file. These are character codes of nonprintable characters. The brackets are there because I copied the file from less. Mac's Terminal, on the other hand, uses ?? to represent such characters. I bet xterm on my Ubuntu would print that white oval with a question mark.

Upvotes: 2

Answers (5)

user2461982

Reputation: 1

If the data always is like the sample, deleting from the less-than to the end of the line would work fine. sed -i "s/<.*$//" file

Upvotes: -2

josh.trow

Reputation: 4901

If you know the crap will always be inside brackets, why not delete that crap?

sed 's/<[^>]*>//g'

EDIT: Thanks, Mike that makes sense. In that case, how about:

sed 's/([0-9]+).*/\1/g'

Upvotes: 0

anubhava

Reputation: 785146

Try this sed command:

sed 's/^\([0-9][0-9]*\).*$/\1/' file.txt

OUTPUT (running above command on the input file you provided)

Upvotes: 0

Jonathan Leffler

Reputation: 753725

Classic job for either sed's or Unix's tr command.

sed 's/[^0-9]//g' $file

(Anything that is not a digit - or newline - is deleted.)

tr -cd '0-9\012' < $file > $file.1

Delete (-d) the complement (-c) of the digits and newline...

Upvotes: 7

deong

Reputation: 3870

You missed the bit where you match the rest of the line.

sed 's/\([0-9][0-9]*\)[^0-9]*/\1/g' 
                      ^^^^^^^

Upvotes: 2

How to remove nonnumeric junk from a file

Answers (5)

OUTPUT (running above command on the input file you provided)

Related Questions