Bilal Naseer
Bilal Naseer

Reputation: 156

Extract numbers using regex and notepad++

I have the following data (in one line):

<span id=​"ctb_0" onclick=​"show_hide_box(this)​;​"
class=​"hide_icon r txtfont ltr">​open​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Rayyan Real Investment​</font>​,
<span class=​"ltr txtfont">​+92-3212459990​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Bukhari Properties​</font>​,
<span class=​"ltr txtfont">​+92-3218248858​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Exact Properties​</font>​,
<span class=​"ltr txtfont">​+92-3312044421​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Exact Properties​</font>​,
<span class=​"ltr txtfont">​+92-3312044421​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Adeel Corporation​</font>​,
<span class=​"ltr txtfont">​+923008253132​</span>​
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Adeel Corporation​</font>​,
<span class=​"ltr txtfont">​+92-3008253132​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Z.S Associates​</font>​,
<span class=​"ltr txtfont">​+92-3452431417​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Keystone Properties​</font>​,
<span class=​"ltr txtfont">​+92-3353509187/301..​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​, 
<font class=​"txtfont ltr">​Adeel Corporation​</font>​,
<span class=​"ltr txtfont">​+92-3008253132​</span>​, 
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Adeel Corporation​</font>​,
<span class=​"ltr txtfont">​+92-3008253132​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Safeway Real Estate Consultant​</font>​,
<span class=​"ltr txtfont">​+92-3218282885/345..​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Abdul Sattar & Sons​</font>​,
<span class=​"ltr txtfont">​+92-3332107802, +9..​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Bismillah Real Estate​</font>​,
<span class=​"ltr txtfont">​+92-3213336525, 03..​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Aiman Estate & Properties​</font>​,
<span class=​"ltr txtfont">​+92-3212537535​</span>​,
<div class=​"description clr ltr txtfont">​…​</div>​,
<font class=​"txtfont ltr">​Aiman Estate & Properties​</font>​,
<span class=​"ltr txtfont">​+92-3212537535​</span>​,

And using regex in notepad++ I want this to be like:

923008929845 
923318874928​​
923008275080
923452113010​​
923002024486​​
923218286664
923218286664​​
923212804245
923002555091​​
​923212804245
923008289996
​923003579717
923003579717​​
923003772227
923007048836​​

I have tried following in notepad++ but its not clean and quick. I am removing HTML code manually which is preventing me to complete my data scraping quickly

Find what: [a-z]|[A-Z]|[,.()_=;"+<>/:-]

Replace with: (Spacebar)

And still seeing lot's of random characters

Upvotes: 1

Views: 1879

Answers (3)

Haji Rahmatullah
Haji Rahmatullah

Reputation: 430

Try this .

Find what: \s.*\s.*?(\d+)-(\d{10})|.+
Replace with: $1$2

Note!!"
this is what I have learnt so far from regex, I'm not good at
Regex, but the above regex worked correctly, except 2 spaces left in between of digits....

Upvotes: 0

Dave Bennett
Dave Bennett

Reputation: 11216

I don't have notepad++ but something like this will get you most of the way there. It matches everything up until the end of the first occurrence of the number pattern you are seeking. And replaces that entire match with the number patterns that were captured and a line feed. A replace all should do it multiple times.

enter image description here

Upvotes: -1

Toto
Toto

Reputation: 91518

How about:

Find what: ^.*?\+(\d\d)-(\d{10}).*?$
Replace with: $1$2\n

Explanation:

^           : begining of line
  .*?       : 0 or more any character (not greedy)
  \+        : +, needs to be escaped because it's a special char for regex
  (\d\d)    : 2 digits captured in group 1
  -         : dash
  (\d{10})  : 10 digits captured in group 2
  .*?       : 0 or more any character (not greedy)
$           : end of line

Upvotes: 3

Related Questions