absolutely
absolutely

Reputation: 21

Problems looking for csv values in txt file using Python

I'm new to Stackoverflow and relatively new to Python. I Have tried searching the site for an answer to this question, but haven't found one related to matching values between csv and txt files.

I'm writing a simple Python script that reads in a row from large csv file (~600k lines), grabs a value from that row, assigns to a variable, then uses the variable to try to find a matching value from a large txt file (~1.8MM lines). It's not working and I'm not sure why.

Here's a snippet from the source.csv file:

DocNo,Title,DOI
1,"Title One",10.1080/02724634.2016.1269539
2,"Title Two",10.1002/2015ja021888
3,"Title Three",10.1016/j.palaeo.2016.09.019

Here's a snippet from the lookup.txt file (note that it's separated by \t):

DOI 10.1016/j.palaeo.2016.09.019    M   First
DOI 10.1016/j.radmeas.2015.12.002   M   First
DOI 10.1097/SCS.0000000000002859    M   First

Here's the offending code:

import csv

with open('source.csv', newline='', encoding = "ISO-8859-1") as f, open('lookup.txt', 'r') as i:
    reader = csv.reader(f, dialect='excel')

    counter = 0

    for line in i:
        for row in reader:
            doi = row[2]
            doi = str(doi) # I think this might actually be redundant...

            if doi in line:
                # This will eventually do more interesting things, but right now it's just a test
                print(doi)
                break
            else:
                # This will be removed--is also just a test (so I can watch progress)
                print(counter)
                counter += 1

Currently, when it runs, it just counts the lines, even though there's a matching doi in each file.

The maddening thing is that when I give doi a hard-coded value, it executes as it should. This makes me think that either the slashes in doi are breaking things somehow, or I need to convert the data type of the doi variable.

For example, this works:

doi = "10.1016/j.palaeo.2016.09.019" 

for line in i:
    if doi in line:
        print(doi)
        break
    else:
        print(counter)
        counter += 1

Thanks in advance for your help, I'm at my wit's end!

Upvotes: 2

Views: 83

Answers (2)

Vince
Vince

Reputation: 645

Your problem is that repeating for line in i: does not start over from the beginning on each loop, but rather it keeps going where it was when you called break the last time. If you have any line in the lookup file i that has no match, you will effectively go through the lookup file completely and then all calls to for line in i: will do nothing (empty loop).

You probably want to keep the lookup lines in a list, as a first step. Turning it into a lookup dict by parsing the row would likely be the next step.

Here is a demonstration of what happens:

!cat 1.txt
row1
row2
row3

!cat 2.txt
row A
row B
row C

with open('1.txt', 'r') as i, open('2.txt', 'r') as j:
    for irow in i:
        print "irow", irow.strip()
        for jrow in j:
            print "jrow", jrow.strip()

irow row1
jrow row A
jrow row B
jrow row C
irow row2
irow row3

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

You can try this:

import csv
data = csv.reader(open('data1.csv'))
data1 = [i.strip('\n').split()[1] for i in open('data2.txt')]
file_data = [i[-1] for i in data if i[-1] in data1]

Output from sample files provided:

['10.1016/j.palaeo.2016.09.019']

Upvotes: 0

Related Questions