Reputation: 91
Here's an output from less
:
487451
487450<A3><BA>1<A3><BA>1
487449<A3><BA>1<A3><BA>1
487448<A3><BA>1<A3><BA>1
487447<A3><BA>1<A3><BA>1
487446<A3><BA>1<A3><BA>1
487445<A3><BA>1<A3><BA>1
484300<A3><BA>1<A3><BA>1
484299<A3><BA>1<A3><BA>1
484297<A3><BA>1<A3><BA>1
484296<A3><BA>1<A3><BA>1
484295<A3><BA>1<A3><BA>1
484294<A3><BA>1<A3><BA>1
484293<A3><BA>1<A3><BA>1
483496
483495
483494
483493
483492
483491
I see a bunch of nonprintable characters here. How do I remove them using sed
/tr
?
My try was 's/\([0-9][0-9]*\)/\1/g'
, but it doesn't work.
EDIT: Okay, let's go further down the source. The numbers are extracted from this file:
487451"><img src="Manage/pic/20100901/Adidas running-429.JPG" alt="Adidas running-429" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
487450"><img src="Manage/pic/20100901/Adidas fs 1<A3><BA>1-060.JPG" alt="Adidas fs 1<A3><BA>1-060" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
The first line is perfectly normal and what most of the lines are. The second is "corrupted". I'd just like to extract the number at the beginning (using 's/\([0-9][0-9]*\).*/\1/g'
, but somehow the nonprintables get into the regex, which should stop at "
.
EDIT II: Here's a clarification: There are no brackets in the text file. These are character codes of nonprintable characters. The brackets are there because I copied the file from less
. Mac's Terminal, on the other hand, uses ??
to represent such characters. I bet xterm
on my Ubuntu would print that white oval with a question mark.
Upvotes: 2
Views: 5904
Reputation: 1
If the data always is like the sample, deleting from the less-than to the end of the line would work fine. sed -i "s/<.*$//" file
Upvotes: -2
Reputation: 4901
If you know the crap will always be inside brackets, why not delete that crap?
sed 's/<[^>]*>//g'
EDIT: Thanks, Mike that makes sense. In that case, how about:
sed 's/([0-9]+).*/\1/g'
Upvotes: 0
Reputation: 785146
Try this sed command:
sed 's/^\([0-9][0-9]*\).*$/\1/' file.txt
487451
487450
487449
487448
487447
487446
487445
484300
484299
484297
484296
484295
484294
484293
483496
483495
483494
483493
483492
483491
Upvotes: 0
Reputation: 753725
Classic job for either sed
's or Unix's tr
command.
sed 's/[^0-9]//g' $file
(Anything that is not a digit - or newline - is deleted.)
tr -cd '0-9\012' < $file > $file.1
Delete (-d
) the complement (-c
) of the digits and newline...
Upvotes: 7
Reputation: 3870
You missed the bit where you match the rest of the line.
sed 's/\([0-9][0-9]*\)[^0-9]*/\1/g'
^^^^^^^
Upvotes: 2