EmJ
EmJ

Reputation: 4608

How to efficiently read the next line in a file

I have a text file as follows.

LA English
DT Article
GJ asthma; susceptible genes; natural language processing analysis; network
   centrality analysis
ID LITERATURE-BASED DISCOVERY; CO-WORD ANALYSIS; UNDISCOVERED PUBLIC
   KNOWLEDGE; INFORMATION-RETRIEVAL; FISH-OIL; SCIENTIFIC COLLABORATION;
   INSULIN-RESISTANCE; COMPLEX NETWORKS; METFORMIN; OBESITY
GJ natural language processing; network analysis
GJ data mining; text mining; learning analytics; deep learning;
   network centrality analysis

I want to get the entire row of GJ entry. i.e. my final output should be as follows.

[[asthma, susceptible genes, natural language processing analysis, network centrality analysis], [natural language processing, network analysis], [data mining, text mining, learning analytics, deep learning, network centrality analysis]]

I am using the following python programme.

with open(input_file, encoding="utf8") as fo:
    for line in fo:

        if line[:2].isupper():

            if line[:2] == 'GJ':
                temp_line = line[2:].strip()

                next_line = next(fo)

                if next_line[:2].isupper():
                    keywords = temp_line.split(';')
                else:
                    mykeywords = temp_keywords + ' ' + next_line.strip()
                    keywords = mykeywords.split(';')
                print(keywords)

However, there is a issue in the way I overlook the next line. Therefore, according to my programme, I do not get the third line of GJ (i.e. [data mining, text mining, learning analytics, deep learning, network centrality analysis]) as an output list.

I am happy to provide more details if needed.

Upvotes: 0

Views: 87

Answers (2)

Kenny Ostrom
Kenny Ostrom

Reputation: 5871

Here's what you are trying to do, and probably could have gotten there with a little debugging.

temp_keywords = ''
mykeywords = ''
with open(input_file, encoding="utf8") as fo:    
    for line in fo:
        if line[:2].isupper():    
            if line[:2] == 'GJ':
                temp_line = line[2:].strip()
                next_line = next(fo)
                temp_line += next_line.strip()
                print (temp_line.split(';'))

The problem here is that calling next(fo) yourself, instead of letting the for loop do its job, means you have to handle all of the for loop's job. So whatever you read into next_line will NOT be processed on the next loop. You will completely miss some lines of the file.

Instead, you always want to let the for loop handle its job.

But what you have here is two different methods of breaking a file up. It's easier to write a record parser which finds full records, and let it read lines from the file as needed. Here is an adaptation of my other answer linked in comments:

def is_new_record(line):
    return line[:2].isupper()

def helper(text):
    data = []
    for line in text.readlines():
        if is_new_record(line):
            if (data):
                yield ''.join(data)
            data = [line.strip()]
        else:
            data.append(line.strip())
    if (data):
        yield ''.join(data)

# the helper is a generator for multiline records, as one line
input_file = 'data.txt'
with open(input_file) as f:
    for record in helper(f):
        print (record)

LA English
DT Article
GJ asthma; susceptible genes; natural language processing analysis; networkcentrality analysis
ID LITERATURE-BASED DISCOVERY; CO-WORD ANALYSIS; UNDISCOVERED PUBLICKNOWLEDGE; INFORMATION-RETRIEVAL; FISH-OIL; SCIENTIFIC COLLABORATION;INSULIN-RESISTANCE; COMPLEX NETWORKS; METFORMIN; OBESITY
GJ natural language processing; network analysis
GJ data mining; text mining; learning analytics; deep learning;network centrality analysis

Upvotes: 2

Philip Tzou
Philip Tzou

Reputation: 6438

Let's try spliting the problem. There are two main logic processes in your code:

  1. Extract each non-indented row with the following indented rows and join them as a single "line".
  2. Filter "GJ" initial lines only.

Here is the code:

def iter_lines(fo):
    cur_line = []
    for row in fo:
        if not row.startswith(' ') and cur_line:
            yield ' '.join(cur_line)
            cur_line = []  # reset the cache
        cur_line.append(row.strip())
    # yield the last line
    if cur_line:
        yield ' '.join(cur_line)


with open(input_file, encoding="utf8") as fo:
    for line in iter_lines(fo):
        if line.startswith('GJ'):
            keywords = [k.strip() for k in line[2:].split(';')]
            print(keywords)

Upvotes: 1

Related Questions