Reputation: 23
Forgive me if this is asked and answered. If so, chalk it up to my being new to programming and not knowing enough to search properly.
I have a need to read in a file containing a series of several hundred phrases, such as names or email addresses, one per line, to be used as part of a compiled search term - pattern = re.search(name). The 'pattern' variable will be used to search another file of over 5 million lines to identify and extract select fields from relevant lines.
The text of the name file being read in for variable would be in the format of:
John\n
Bill\n
[email protected]\n
Sally\n
So far I have the below code which does not error out, but also does not process and close out. If I pass the names manually using slightly different code with a sys.argv[1], everything works fine. The code (which should be) in bold is the area I am having problems with - starting at "lines = open...."
import sys
import re
import csv
import os
searchdata = open("reallybigfile", "r")
Certfile = csv.writer(open('Certfile.csv', 'ab'), delimiter=',')
**lines = open("Filewithnames.txt", 'r')
while True:
for line in lines:
line.rstrip('\n')
lines.seek(0)
for nam in lines:
pat = re.compile(nam)**
for f in searchdata.readlines():
if pat.search(f):
fields = f.strip().split(',')
Certfile.writerow([nam, fields[3], fields[4]])
lines.close()
The code at the bottom (starting "for f in searchdata.readlines():") locates, extracts and writes the fields fine. I have been unable to find a way to read in the Filewithnames.txt file and have it use each line. It either hangs, as with this code, or it reads all lines of the file to the last line and returns data only for the last line, e.g. 'Sally'.
Thanks in advance.
Upvotes: 2
Views: 1604
Reputation: 1828
while True
is an infinite loop, and there is no way to break out of it that I can see. That will definitely cause the program to continue to run forever and not throw an error.
Remove the while True
line and de-indent that loop's code, and see what happens.
EDIT:
I have resolved a few issues, as commented, but I will leave you to figure out the precise regex you need to accomplish your goal.
import sys
import re
import csv
import os
searchdata = open("c:\\dev\\in\\1.txt", "r")
# Certfile = csv.writer(open('c:\\dev\\Certfile.csv', 'ab'), delimiter=',') #moved to later to ensure the file will be closed
lines = open("c:\\dev\\in\\2.txt", 'r')
pats = [] # An array of patterns
for line in lines:
line.rstrip()
lines.seek(0)
# Add additional conditioning/escaping of input here.
for nam in lines:
pats.append(re.compile(nam))
with open('c:\\dev\\Certfile.csv', 'ab') as outfile: #This line opens the file
Certfile = csv.writer(outfile, delimiter=',') #This line interprets the output into CSV
for f in searchdata.readlines():
for pat in pats: #A loop for processing all of the patterns
if pat.search(f) is not None:
fields = f.strip().split(',')
Certfile.writerow([pat.pattern, fields[3], fields[4]])
lines.close()
searchdata.close()
First of all, make sure to close all the files, including your output file.
As stated before, the while True
loop was causing you to run infinitely.
You need a regex or set of regexes to cover all of your possible "names." The code is simpler to do a set of regexes, so that is what I have done here. This may not be the most efficient. This includes a loop for processing all of the patterns.
I believe you need additional parsing of the input file to give you clean regular expressions. I have left some space for you to do that.
Hope that helps!
Upvotes: 2