multiprocessing reading lines in file

So I have a program that checks proxies from a list however it's very slow so I added multiprocessing. But my problem is when I run the program it reads only the first line from the text file but when I run the code without multiprocessing it reads down the lines in the file. IDK I think it's something to do with {proxies = file.readline()}.

import requests
from lxml.html import fromstring
import time
import multiprocessing
from multiprocessing import Pool

#kind = input("socks4, socks5, http/https:\n")
kind = 'socks4'
checking = True

file = open("SOCKS4.txt",'r')


def check():
    proxies = file.readline()
    proxys = {'http': kind + '://' + proxies, 'https': kind + '://' + proxies}
    url = ('http://checkip.dyndns.com/')
    try:
        response = requests.get(url, timeout = 2.5, proxies = proxys)
    except requests.exceptions.Timeout:
        print('Bad', proxies)
    except requests.exceptions.ConnectionError:
        print('Network problem', proxies)
    else:
        print('Good', proxies, 'Response time', response.elapsed)
        files = open('goods.txt', 'a+')
        files.write('\n' + proxies)

if __name__ == '__main__':
   while checking:
       p = multiprocessing.Process(target=check)
       p.start()

Upvotes: 0

Answers (3)

RaJa

Reputation: 1567

I would suggest to put all the lines from the file into a queue and let the child processes pick the lines from the queue. It's similar to the solution from @salparadise but you spawn new processes only once. Something like below:

def check(queue):
    for line in iter(queue.get, 'STOP'):
        proxys = {'http': kind + '://' + line, 'https': kind + '://' + line}
        url = ('http://checkip.dyndns.com/')
        try:
            response = requests.get(url, timeout = 2.5, proxies = proxys)
        except requests.exceptions.Timeout:
            print('Bad', proxies)
        except requests.exceptions.ConnectionError:
            print('Network problem', proxies)
        else: 
            print('Good', line, 'Response time', response.elapsed)
            # "with" closes the filehandle when done.
            with open('goods.txt', 'a+') as files:
                files.write('\n' + proxies)


if __name__ == '__main__':
   queue = mp.queue()
   p = multiprocessing.Process(target=check, args=(queue,)) 
   p.start()
   with open("SOCKS4.txt",'r') as file_handle: 
       for line in file_handle:
           queue.put(line)

Upvotes: 0

salparadise

Reputation: 5805

Since making http network requests is a I/O bound operation, you should probably be using threads vs. multi processing, as the latter is more friendly for CPU bound operations, which is not what you are doing.

Since multiple processes are independent of each other (unless you are using queues, shared memory location, or file) each process is getting a filehandle and are reading the first line without any awareness of each other.

Change the function to take a line entry, this way each process can take one filename:

def check(proxies):
    proxys = {'http': kind + '://' + proxies, 'https': kind + '://' + proxies}
    url = ('http://checkip.dyndns.com/')
    try:
        response = requests.get(url, timeout = 2.5, proxies = proxys)
    except requests.exceptions.Timeout:
        print('Bad', proxies)
    except requests.exceptions.ConnectionError:
        print('Network problem', proxies)
    else:
        print('Good', proxies, 'Response time', response.elapsed)
        # "with" closes the filehandle when done.
        with open('goods.txt', 'a+') as files:
            files.write('\n' + proxies)


if __name__ == '__main__':
   with open("SOCKS4.txt",'r') as file_handle: # "with" closes the filehandle when it is done
       # iterates through each line of the file
       for line in file_handle:
           p = multiprocessing.Process(target=check, args=(line,)) # feed each line to the function
           p.start()

Upvotes: 0

Grifball

Reputation: 936

I think maybe python is not good at multiplexing a file object across threads. I simplified and changed your code and it seems to work better:

import multiprocessing

file = open("test.txt",'r')


def check(proxies):
    print(proxies)

if __name__ == '__main__':
   while True:
       proxies = file.readline()
       p = multiprocessing.Process(target=check, args=(proxies,))
       p.start()

where test.txt is an example file I made:

test
asdf
1
2
3
4

This code seems to process all lines of the file correctly (though out of order):

$ python3.8 test.py | grep -v "^$"
test
1
asdf
2
4
3

You'll still need a way to stop the loop, which I don't do in this code.

In my version, I read the file serially, but still process the file in multiple threads. I read the file outside of the loop and pass the resulting line as an argument to the thread. This may not be as fast as you want, but I'm not sure how to do it faster. It should still be pretty fast as (when you integrate my changes) it will not wait for a response before starting another request.

Upvotes: 1

multiprocessing reading lines in file

Answers (3)

Related Questions