Reputation: 241
I have a file which I'm trying to extract information from, the file has the information in it and is in a neat line by line format, the information is separated by commas.
I want to put it in a list, or do whatever I can to extract information from a specific index. The file is huge with over 1000000000 lines, I have to extract the same index in every line in order to get the same piece of information. These are HASHES I want from the files so I was wondering how I'd find all the occurrences of hashes based on length.
import os
os.chdir('C:\HashFiles')
f = open('Part1.txt','r')
file_contents=f.readlines()
def linesA():
for line in file_contents:
lista = line.split(',')
print linesA()
this is all I have so far and this just puts everything in a list which I can index from, but I want to output the data from those indexes to another file and I am unable to because of the for statement, how can I get around this?
Wow you guys are great, now I have a problem because in the file where this info is stored it starts with information about the sponsor who provided the information, how do I bypass those lines to start from another line since the lines I need start at about 100 lines down the file, to help me because at the moment I get an index error and am unable to figure out how to set a condition to counter it. I tried this condition but didnt work : if line[:] != 15: continue
Most recent code to work with:
import csv
with open('c:/HashFiles/search_engine_primary.sql') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
for i in xrange(47):
inf.next() # skip a line
for line in inf:
data = line.split(',')
if str(line[0]) == 'GO':
continue
hash = data[15]
outf.write(hash + '\n')
Upvotes: 1
Views: 2535
Reputation: 56634
You can process the file line-by-line, like so:
with open('c:/HashFiles/Part1.txt') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
for line in inf:
data = line.split(',')
hash = data[4]
outf.write(hash + '\n')
If you want to separate the hashes by length, maybe something like:
class HashStorage(object):
def __init__(self, fname_fmt):
self.fname_fmt = fname_fmt
self.hashfile = {}
def thefile(self, hash):
hashlen = len(hash)
try:
return self.hashfile[hashlen]
except KeyError:
newfile = open(self.fname_fmt.format(hashlen), 'w')
self.hashfile[hashlen] = newfile
return newfile
def write(self, hash):
self.thefile(hash).write(hash + '\n')
def __del__(self):
for f in self.hashfiles.itervalues():
f.close()
del self.hashfiles
store = HashStorage('c:/HashFiles/hashes{}.txt')
with open('c:/HashFiles/Part1.txt') as inf:
for line in inf:
data = line.split(',')
hash = data[4]
store.write(hash)
Edit:: is there any way to identify sponsor lines - for example, they start with "#"? You could filter like
with open('c:/HashFiles/Part1.txt') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
for line in inf:
if not line.startswith('#'):
data = line.split(',')
hash = data[4]
outf.write(hash + '\n')
otherwise, if you have to skip N lines - this is nasty, because what if the number changes? - you can instead
with open('c:/HashFiles/Part1.txt') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
for i in xrange(N):
inf.next() # skip a line
for line in inf:
data = line.split(',')
hash = data[4]
outf.write(hash + '\n')
Edit2:
with open('c:/HashFiles/search_engine_primary.sql') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
for i in xrange(47):
inf.next() # skip a line
for line in inf:
data = line.split(',')
if len(data) > 15: # skip any line without enough data items
hash = data[15]
outf.write(hash + '\n')
Does this still give you errors??
Upvotes: 2
Reputation: 143047
You could try to process the file line-by-line
with open('Part1.txt') as inf:
for line in inf:
# do your processing
# ... line.split(',') etc...
rather than using readlines()
which reads all of the data into memory at once.
Also, depending on what you are doing, list comprehension could be helpful in creating your desired output list form the file you are reading.
NOTE: The benefit of using with
to open the file is that it will automatically close it for you when you are done, or an exception is encountered.
UPDATE:
To skip the first N
lines of your input file you can change your code to this:
N = 100
with open('Part1.txt') as inf:
for i, line in enumerate(inf, 1):
if i < N: # if line is less than N
continue # skip the processing
print line # process the line
I am using enumerate() to automatically generate line numbers. I start this counter at 1 (default is 0 if not specified).
Upvotes: 4
Reputation: 9704
import csv
with open(os.path.join('C:\HashFiles','Part1.txt'), 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
Upvotes: 1