Gupta
Gupta

Reputation: 314

Why does the code matches only first and last match rather than all?

I am trying to read phone nos from this file (below) having multiple phone nos using regex

import re
import pandas as pd  

url = "https://raw.githubusercontent.com/CoreyMSchafer/code_snippets/master/Python-Regular-Expressions/data.txt"
 # the file has multiple phone nos.

address = str(pd.read_fwf(url,header=None))
phoneno = re.compile(r"\d\d\d[-.]\d\d\d[-.]\d\d\d\d") # phone nos

# this creates a variable
matches = phoneno.finditer(address)

for match in matches:
    print(match)

My expected output was multiple matches but it gives just 2 matches

<re.Match object; span=(122, 134), match='615-555-7164'>
<re.Match object; span=(437, 449), match='900-555-6426'>

Upvotes: 1

Views: 93

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

The issue is that when you use str(df) the result is truncated to display just some of the rows:

>>> address = str(pd.read_fwf(url,header=None))
>>> print(address)
                                           0
0                                Dave Martin
1                               615-555-7164
2         173 Main St., Springfield RI 55924
3                  [email protected]
4                             Charles Harris
..                                       ...
395                [email protected]
396                           Charles Miller
397                             900-555-6426
398  207 Washington St., Blackwater MA 24886
399             [email protected]

[400 rows x 1 columns]

This string only contains two matches, just what you get.

You can get them using

data = pd.read_fwf(url,header=None)
matches = list(filter(phoneno.fullmatch, data[0]))
>>> matches
# => ['615-555-7164', '800-555-5669', '560-555-5153', '900-555-9340', '714-555-7405', '800-555-6771', '783-555-4799', '516-555-4615', '127-555-1867', '608-555-4938', '568-555-6051', '292-555-1875', '900-555-3205', '614-555-1166', '530-555-2676', '470-555-2750', '800-555-6089', '880-555-8319', '777-555-8378', '998-555-7385', '800-555-7100', '903-555-8277', '196-555-5674', '900-555-5118', '905-555-1630', '203-555-3475', '884-555-8444', '904-555-8559', '889-555-7393', '195-555-2405', '321-555-9053', '133-555-1711', '900-555-5428', '760-555-7147', '391-555-6621', '932-555-7724', '609-555-7908', '800-555-8810', '149-555-7657', '130-555-9709', '143-555-9295', '903-555-9878', '574-555-3194', '496-555-7533', '210-555-3757', '900-555-9598', '866-555-9844', '669-555-7159', '152-555-7417', '893-555-9832', '217-555-7123', '786-555-6544', '780-555-2574', '926-555-8735', '895-555-3539', '874-555-3949', '800-555-2420', '936-555-6340', '372-555-9809', '890-555-5618', '670-555-3005', '509-555-5997', '721-555-5632', '900-555-3567', '147-555-6830', '582-555-3426', '400-555-1706', '525-555-1793', '317-555-6700', '974-555-8301', '800-555-3216', '746-555-4094', '922-555-1773', '711-555-4427', '355-555-1872', '852-555-6521', '691-555-5773', '332-555-5441', '900-555-7755', '379-555-3685', '127-555-9682', '789-555-7032', '783-555-5135', '315-555-6507', '481-555-5835', '365-555-8287', '911-555-7535', '681-555-2460', '274-555-9800', '800-555-1372', '300-555-7821', '133-555-3889', '705-555-6863', '215-555-9449', '988-555-6112', '623-555-3006', '192-555-4977', '178-555-4899', '952-555-3089', '900-555-6426']

All the phone numbers are separate items in the column. Hence, all you need is get all those items that fully match your pattern.

You may also improve the regex a bit by declaring it as

phoneno = re.compile(r"\d{3}[-.]\d{3}[-.]\d{4}")

The .fullmatch method only returns true if the whole string matches the regex pattern.

Upvotes: 1

Tuan Bao
Tuan Bao

Reputation: 266

I have two ways to read all text from URL, then return all matchObject that matched the regex: \d{3}[-.]\d{3}[-.]\d{4}

1. Use pandas, try to parse the URL as one column and convert it to a string, then search all phones by regex.

#python 3x
import pandas
import re

url = "https://raw.githubusercontent.com/CoreyMSchafer/code_snippets/master/Python-Regular-Expressions/data.txt"

#regex
phones=re.compile(r'\d{3}[-.]\d{3}[-.]\d{4}')

data = pandas.read_fwf( url, header=None )
DATA_col0_as_string=data.to_string( )

#result
matches=phones.finditer( DATA_col0_as_string )
for matchObject in matches:
    print( matchObject )

output

<re.Match object; span=(122, 134), match='615-555-7164'>
<re.Match object; span=(302, 314), match='800-555-5669'>  
...
<re.Match object; span=(17762, 17774), match='952-555-3089'>
<re.Match object; span=(17942, 17954), match='900-555-6426'>

2. Use lib urllib to get all text of the URL as a string and then search all phones by regex.

#python 3x
import urllib.request as uRequest
import re

url = "https://raw.githubusercontent.com/CoreyMSchafer/code_snippets/master/Python-Regular-Expressions/data.txt"

#read all text of your url
addesses=uRequest.urlopen( url, timeout=2 ).read( ).decode( 'utf8' )

#regex
phones=re.compile( r'\d{3}[-.]\d{3}[-.]\d{4}' )

#result
matches =phones.finditer( addesses )
for matchObject in matches:
    print( matchObject )

output

<re.Match object; span=(12, 24), match='615-555-7164'>
<re.Match object; span=(102, 114), match='800-555-5669'>
...
<re.Match object; span=(8648, 8660), match='952-555-3089'>
<re.Match object; span=(8741, 8753), match='900-555-6426'>

Notice: 1. is what you look for but what accurately simulates the span of the phone numbers is 2..

Upvotes: 1

Related Questions