Ferdinand
Ferdinand

Reputation: 241

parsing/extracting data from a text file. Unable to make it work

I have a file which I'm trying to extract information from, the file has the information in it and is in a neat line by line format, the information is separated by commas.

I want to put it in a list, or do whatever I can to extract information from a specific index. The file is huge with over 1000000000 lines, I have to extract the same index in every line in order to get the same piece of information. These are HASHES I want from the files so I was wondering how I'd find all the occurrences of hashes based on length.

import os

os.chdir('C:\HashFiles')

f = open('Part1.txt','r')

file_contents=f.readlines()

def linesA():

for line in file_contents:
    lista = line.split(',')

print linesA()

this is all I have so far and this just puts everything in a list which I can index from, but I want to output the data from those indexes to another file and I am unable to because of the for statement, how can I get around this?

Wow you guys are great, now I have a problem because in the file where this info is stored it starts with information about the sponsor who provided the information, how do I bypass those lines to start from another line since the lines I need start at about 100 lines down the file, to help me because at the moment I get an index error and am unable to figure out how to set a condition to counter it. I tried this condition but didnt work : if line[:] != 15: continue

Most recent code to work with:

import csv

with open('c:/HashFiles/search_engine_primary.sql') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
for i in xrange(47):
    inf.next()       # skip a line

for line in inf:
    data = line.split(',')
    if str(line[0]) == 'GO':
        continue
    hash = data[15]
    outf.write(hash + '\n')

Upvotes: 1

Views: 2535

Answers (3)

Hugh Bothwell
Hugh Bothwell

Reputation: 56634

You can process the file line-by-line, like so:

with open('c:/HashFiles/Part1.txt') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
    for line in inf:
        data = line.split(',')
        hash = data[4]
        outf.write(hash + '\n')

If you want to separate the hashes by length, maybe something like:

class HashStorage(object):
    def __init__(self, fname_fmt):
        self.fname_fmt = fname_fmt
        self.hashfile = {}

    def thefile(self, hash):
        hashlen = len(hash)
        try:
            return self.hashfile[hashlen]
        except KeyError:
            newfile = open(self.fname_fmt.format(hashlen), 'w')
            self.hashfile[hashlen] = newfile
            return newfile

    def write(self, hash):
        self.thefile(hash).write(hash + '\n')

    def __del__(self):
        for f in self.hashfiles.itervalues():
            f.close()
        del self.hashfiles

store = HashStorage('c:/HashFiles/hashes{}.txt')

with open('c:/HashFiles/Part1.txt') as inf:
    for line in inf:
        data = line.split(',')
        hash = data[4]
        store.write(hash)

Edit:: is there any way to identify sponsor lines - for example, they start with "#"? You could filter like

with open('c:/HashFiles/Part1.txt') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
    for line in inf:
        if not line.startswith('#'):
            data = line.split(',')
            hash = data[4]
            outf.write(hash + '\n')

otherwise, if you have to skip N lines - this is nasty, because what if the number changes? - you can instead

with open('c:/HashFiles/Part1.txt') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
    for i in xrange(N):
        inf.next()       # skip a line

    for line in inf:
        data = line.split(',')
        hash = data[4]
        outf.write(hash + '\n')

Edit2:

with open('c:/HashFiles/search_engine_primary.sql') as inf, open('c:/HashFiles/hashes.txt','w') as outf:
    for i in xrange(47):
        inf.next()       # skip a line

    for line in inf:
        data = line.split(',')
        if len(data) > 15:      # skip any line without enough data items
            hash = data[15]
            outf.write(hash + '\n')

Does this still give you errors??

Upvotes: 2

Levon
Levon

Reputation: 143047

You could try to process the file line-by-line

 with open('Part1.txt') as inf:
      for line in inf:
          # do your processing
          # ... line.split(',') etc...

rather than using readlines() which reads all of the data into memory at once.

Also, depending on what you are doing, list comprehension could be helpful in creating your desired output list form the file you are reading.

NOTE: The benefit of using with to open the file is that it will automatically close it for you when you are done, or an exception is encountered.

UPDATE:

To skip the first N lines of your input file you can change your code to this:

N = 100

with open('Part1.txt') as inf:
     for i, line in enumerate(inf, 1):
         if i < N:   # if line is less than N
            continue # skip the processing
         print line  # process the line

I am using enumerate() to automatically generate line numbers. I start this counter at 1 (default is 0 if not specified).

Upvotes: 4

Diego Navarro
Diego Navarro

Reputation: 9704

import csv

with open(os.path.join('C:\HashFiles','Part1.txt'), 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        print row

Upvotes: 1

Related Questions