How to efficiently read the next line in a file

Question

I have a text file as follows.

LA English
DT Article
GJ asthma; susceptible genes; natural language processing analysis; network
   centrality analysis
ID LITERATURE-BASED DISCOVERY; CO-WORD ANALYSIS; UNDISCOVERED PUBLIC
   KNOWLEDGE; INFORMATION-RETRIEVAL; FISH-OIL; SCIENTIFIC COLLABORATION;
   INSULIN-RESISTANCE; COMPLEX NETWORKS; METFORMIN; OBESITY
GJ natural language processing; network analysis
GJ data mining; text mining; learning analytics; deep learning;
   network centrality analysis

I want to get the entire row of GJ entry. i.e. my final output should be as follows.

[[asthma, susceptible genes, natural language processing analysis, network centrality analysis], [natural language processing, network analysis], [data mining, text mining, learning analytics, deep learning, network centrality analysis]]

I am using the following python programme.

with open(input_file, encoding="utf8") as fo:
    for line in fo:

        if line[:2].isupper():

            if line[:2] == 'GJ':
                temp_line = line[2:].strip()

                next_line = next(fo)

                if next_line[:2].isupper():
                    keywords = temp_line.split(';')
                else:
                    mykeywords = temp_keywords + ' ' + next_line.strip()
                    keywords = mykeywords.split(';')
                print(keywords)

However, there is a issue in the way I overlook the next line. Therefore, according to my programme, I do not get the third line of GJ (i.e. [data mining, text mining, learning analytics, deep learning, network centrality analysis]) as an output list.

I am happy to provide more details if needed.

Philip Tzou · Accepted Answer

Let's try spliting the problem. There are two main logic processes in your code:

Extract each non-indented row with the following indented rows and join them as a single "line".
Filter "GJ" initial lines only.

Here is the code:

def iter_lines(fo):
    cur_line = []
    for row in fo:
        if not row.startswith(' ') and cur_line:
            yield ' '.join(cur_line)
            cur_line = []  # reset the cache
        cur_line.append(row.strip())
    # yield the last line
    if cur_line:
        yield ' '.join(cur_line)


with open(input_file, encoding="utf8") as fo:
    for line in iter_lines(fo):
        if line.startswith('GJ'):
            keywords = [k.strip() for k in line[2:].split(';')]
            print(keywords)

How to efficiently read the next line in a file

Answers (2)

Related Questions