onlyf
onlyf

Reputation: 883

Extracting numeric pattern from file line

I have a file that has the following format:

 EDouble entry for scenario XX AAA 70337262003 Line 000000003350
 EDouble entry for scenario XX AAA 70337262003 Line 000000003347
 EDouble entry for scenario XX AAA 71375201001 Line 000000003353
 EDouble entry for scenario XX AAA 71375201001 Line 000000003351
 EDouble entry (different date/time) for scenario YY AAA 10722963407 Line   000000000447
 EDouble entry for scenario YY AAA 55173006602 Line 000000002868
 EDouble entry (different date/time) for scenario YY AAA 60404822801 Line 000000003285

What I want to do is basically strip away all the alphabet characters and output a file that contains:

70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801

I've thought of a couple ways that could assist me in getting there, simply listing some ideas since I don't have a ready solution. I could strip all alphabetic characters with:

tr -d '[[:alpha:]]'

but that would still mean I would need to process the file further to separate the first number from the second. Sed could perhaps provide a simpler solution since the second number will always start with 0.

  sed -n 's/.*\[1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1- 9]\).*/\1/p'

to find the pattern, and only printing pattern – but the above command doesn't output anything. Could someone help me please? It's not necessary to accomplish this with sed, I imagine awk with gsub and grep have something similar?

Upvotes: 1

Views: 67

Answers (5)

Jahid
Jahid

Reputation: 22428

With grep you can do this:

grep -o '[1-9][0-9]\{10\}' file

With sed:

sed -n 's/.*\([1-9][0-9]\{10\}\).*/\1/p' file

There's a narrow margin of error targeting 11 digits, as the numbers starting with 0 are 12 digits long. A more robust solution considering that fact would be:

sed -n 's/.*[[:blank:]]\([1-9][0-9]\{10\}\).*/\1/p' file

i.e make sure to match a [[:blank:]] before the number.

Upvotes: 1

Tyl
Tyl

Reputation: 5252

So If you prefer sed, use this:

sed -rn "s@.*([1-9][0-9]{10}).*@\1@p" file.txt

Upvotes: 2

Benjamin W.
Benjamin W.

Reputation: 52431

This one extracts a group of digits followed by a word boundary, but not followed by the end of the line:

$ grep -Po '\d+\b(?!$)' infile
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801
  • -P enables Perl regular expressions
  • -o retains only the match
  • \d+\b greedily matches digits followed by a word boundary
  • (?!$) is a "negative look-ahead": if the next character is the end of the line, don't match

Upvotes: 1

Cyrus
Cyrus

Reputation: 88774

Print third to last column:

awk '{print $(NF-2)}' file

Output:

70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801

Upvotes: 2

riteshtch
riteshtch

Reputation: 8769

I see that AAA is constant in all rows behind the number.

Therefore you can use this:

$ grep -oP '(?<=AAA\s)\s*\d+' data
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801

Upvotes: 1

Related Questions