DeathNet123
DeathNet123

Reputation: 86

Using Python to Access Web Data with Regular Expression is not working

I am doing Python for everybody's Course on Coursera so I just learned how to access the file from the Web with Python.

So here what I am trying to do is to extract the Email from the lines which are starting with the From: but I am getting nothing.

There are emails in lines which are starting with From: because I have done this with File Handling method but it's not working when I tried it on file which is on Server so I guess it is to do with the white space.

So Anyways Guys, Help me I am stuck

import socket
import re
dic = dict()
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
    mysock.connect(('data.pr4e.org', 80))
except:
    print("Can't find the server.\nCheck your internet Connection")
cmd = 'GET http://data.pr4e.org/mbox-short.txt HTTP/1.0\r\n\r\n'.encode()
try:
    mysock.send(cmd)
except:
    print("Connection Lost:\nCheck your Internet Connection")
while True:
    data = mysock.recv(512)
    if len(data)  < 1:
        break
    data = data.decode()
    data = data.rstrip()
    k = re.findall('^From:.(\S+@\S+)', data)
    if (len(k)) > 0:
        print(k)

This is the Link from where you can download the file

Upvotes: 2

Views: 165

Answers (2)

DeathNet123
DeathNet123

Reputation: 86

Well, I found the better way of what I am doing here. I can do this easily and more efficiently by using the urllib.request library.

import urllib.request, urllib.parse, urllib.error
import re

fhand = urllib.request.urlopen('http://data.pr4e.org/mbox-short.txt')
for line in fhand:
    k = re.findall(r'(?m)^From:\s*(\S+@\S+)', line)
    if len(k) > 1:
       print(k)    

Upvotes: -1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626952

You may get the emails using

k = re.findall(r'(?m)^From:\s*(\S+@\S+)', data)

See the regex demo.

Details

  • (?m)^ - start of a line
  • From: - a literal string
  • \s* - 0+ whitespaces
  • (\S+@\S+) - Capturing group 1 (the output of re.findall will only contain this value): one or more non-whitespace chars, @ and one or more non-whitespace chars.

Upvotes: 3

Related Questions