Reputation: 314
I am trying to read phone nos from this file (below) having multiple phone nos using regex
import re
import pandas as pd
url = "https://raw.githubusercontent.com/CoreyMSchafer/code_snippets/master/Python-Regular-Expressions/data.txt"
# the file has multiple phone nos.
address = str(pd.read_fwf(url,header=None))
phoneno = re.compile(r"\d\d\d[-.]\d\d\d[-.]\d\d\d\d") # phone nos
# this creates a variable
matches = phoneno.finditer(address)
for match in matches:
print(match)
My expected output was multiple matches but it gives just 2 matches
<re.Match object; span=(122, 134), match='615-555-7164'>
<re.Match object; span=(437, 449), match='900-555-6426'>
Upvotes: 1
Views: 93
Reputation: 626794
The issue is that when you use str(df)
the result is truncated to display just some of the rows:
>>> address = str(pd.read_fwf(url,header=None))
>>> print(address)
0
0 Dave Martin
1 615-555-7164
2 173 Main St., Springfield RI 55924
3 [email protected]
4 Charles Harris
.. ...
395 [email protected]
396 Charles Miller
397 900-555-6426
398 207 Washington St., Blackwater MA 24886
399 [email protected]
[400 rows x 1 columns]
This string only contains two matches, just what you get.
You can get them using
data = pd.read_fwf(url,header=None)
matches = list(filter(phoneno.fullmatch, data[0]))
>>> matches
# => ['615-555-7164', '800-555-5669', '560-555-5153', '900-555-9340', '714-555-7405', '800-555-6771', '783-555-4799', '516-555-4615', '127-555-1867', '608-555-4938', '568-555-6051', '292-555-1875', '900-555-3205', '614-555-1166', '530-555-2676', '470-555-2750', '800-555-6089', '880-555-8319', '777-555-8378', '998-555-7385', '800-555-7100', '903-555-8277', '196-555-5674', '900-555-5118', '905-555-1630', '203-555-3475', '884-555-8444', '904-555-8559', '889-555-7393', '195-555-2405', '321-555-9053', '133-555-1711', '900-555-5428', '760-555-7147', '391-555-6621', '932-555-7724', '609-555-7908', '800-555-8810', '149-555-7657', '130-555-9709', '143-555-9295', '903-555-9878', '574-555-3194', '496-555-7533', '210-555-3757', '900-555-9598', '866-555-9844', '669-555-7159', '152-555-7417', '893-555-9832', '217-555-7123', '786-555-6544', '780-555-2574', '926-555-8735', '895-555-3539', '874-555-3949', '800-555-2420', '936-555-6340', '372-555-9809', '890-555-5618', '670-555-3005', '509-555-5997', '721-555-5632', '900-555-3567', '147-555-6830', '582-555-3426', '400-555-1706', '525-555-1793', '317-555-6700', '974-555-8301', '800-555-3216', '746-555-4094', '922-555-1773', '711-555-4427', '355-555-1872', '852-555-6521', '691-555-5773', '332-555-5441', '900-555-7755', '379-555-3685', '127-555-9682', '789-555-7032', '783-555-5135', '315-555-6507', '481-555-5835', '365-555-8287', '911-555-7535', '681-555-2460', '274-555-9800', '800-555-1372', '300-555-7821', '133-555-3889', '705-555-6863', '215-555-9449', '988-555-6112', '623-555-3006', '192-555-4977', '178-555-4899', '952-555-3089', '900-555-6426']
All the phone numbers are separate items in the column. Hence, all you need is get all those items that fully match your pattern.
You may also improve the regex a bit by declaring it as
phoneno = re.compile(r"\d{3}[-.]\d{3}[-.]\d{4}")
The .fullmatch
method only returns true if the whole string matches the regex pattern.
Upvotes: 1
Reputation: 266
I have two ways to read all text from URL, then return all matchObject
that matched the regex: \d{3}[-.]\d{3}[-.]\d{4}
1. Use pandas
, try to parse the URL as one column and convert it to a string, then search all phones by regex.
#python 3x
import pandas
import re
url = "https://raw.githubusercontent.com/CoreyMSchafer/code_snippets/master/Python-Regular-Expressions/data.txt"
#regex
phones=re.compile(r'\d{3}[-.]\d{3}[-.]\d{4}')
data = pandas.read_fwf( url, header=None )
DATA_col0_as_string=data.to_string( )
#result
matches=phones.finditer( DATA_col0_as_string )
for matchObject in matches:
print( matchObject )
output
<re.Match object; span=(122, 134), match='615-555-7164'>
<re.Match object; span=(302, 314), match='800-555-5669'>
...
<re.Match object; span=(17762, 17774), match='952-555-3089'>
<re.Match object; span=(17942, 17954), match='900-555-6426'>
2. Use lib urllib
to get all text of the URL as a string and then search all phones by regex.
#python 3x
import urllib.request as uRequest
import re
url = "https://raw.githubusercontent.com/CoreyMSchafer/code_snippets/master/Python-Regular-Expressions/data.txt"
#read all text of your url
addesses=uRequest.urlopen( url, timeout=2 ).read( ).decode( 'utf8' )
#regex
phones=re.compile( r'\d{3}[-.]\d{3}[-.]\d{4}' )
#result
matches =phones.finditer( addesses )
for matchObject in matches:
print( matchObject )
output
<re.Match object; span=(12, 24), match='615-555-7164'>
<re.Match object; span=(102, 114), match='800-555-5669'>
...
<re.Match object; span=(8648, 8660), match='952-555-3089'>
<re.Match object; span=(8741, 8753), match='900-555-6426'>
Notice: 1.
is what you look for but what accurately simulates the span
of the phone numbers is 2.
.
Upvotes: 1