Reputation: 4608
I have a text file as follows.
LA English
DT Article
GJ asthma; susceptible genes; natural language processing analysis; network
centrality analysis
ID LITERATURE-BASED DISCOVERY; CO-WORD ANALYSIS; UNDISCOVERED PUBLIC
KNOWLEDGE; INFORMATION-RETRIEVAL; FISH-OIL; SCIENTIFIC COLLABORATION;
INSULIN-RESISTANCE; COMPLEX NETWORKS; METFORMIN; OBESITY
GJ natural language processing; network analysis
GJ data mining; text mining; learning analytics; deep learning;
network centrality analysis
I want to get the entire row of GJ
entry. i.e. my final output should be as follows.
[[asthma, susceptible genes, natural language processing analysis, network centrality analysis], [natural language processing, network analysis], [data mining, text mining, learning analytics, deep learning, network centrality analysis]]
I am using the following python programme.
with open(input_file, encoding="utf8") as fo:
for line in fo:
if line[:2].isupper():
if line[:2] == 'GJ':
temp_line = line[2:].strip()
next_line = next(fo)
if next_line[:2].isupper():
keywords = temp_line.split(';')
else:
mykeywords = temp_keywords + ' ' + next_line.strip()
keywords = mykeywords.split(';')
print(keywords)
However, there is a issue in the way I overlook the next line. Therefore, according to my programme, I do not get the third line of GJ
(i.e. [data mining, text mining, learning analytics, deep learning, network centrality analysis]
) as an output list.
I am happy to provide more details if needed.
Upvotes: 0
Views: 87
Reputation: 5871
Here's what you are trying to do, and probably could have gotten there with a little debugging.
temp_keywords = ''
mykeywords = ''
with open(input_file, encoding="utf8") as fo:
for line in fo:
if line[:2].isupper():
if line[:2] == 'GJ':
temp_line = line[2:].strip()
next_line = next(fo)
temp_line += next_line.strip()
print (temp_line.split(';'))
The problem here is that calling next(fo) yourself, instead of letting the for loop do its job, means you have to handle all of the for loop's job. So whatever you read into next_line will NOT be processed on the next loop. You will completely miss some lines of the file.
Instead, you always want to let the for loop handle its job.
But what you have here is two different methods of breaking a file up. It's easier to write a record parser which finds full records, and let it read lines from the file as needed. Here is an adaptation of my other answer linked in comments:
def is_new_record(line):
return line[:2].isupper()
def helper(text):
data = []
for line in text.readlines():
if is_new_record(line):
if (data):
yield ''.join(data)
data = [line.strip()]
else:
data.append(line.strip())
if (data):
yield ''.join(data)
# the helper is a generator for multiline records, as one line
input_file = 'data.txt'
with open(input_file) as f:
for record in helper(f):
print (record)
LA English
DT Article
GJ asthma; susceptible genes; natural language processing analysis; networkcentrality analysis
ID LITERATURE-BASED DISCOVERY; CO-WORD ANALYSIS; UNDISCOVERED PUBLICKNOWLEDGE; INFORMATION-RETRIEVAL; FISH-OIL; SCIENTIFIC COLLABORATION;INSULIN-RESISTANCE; COMPLEX NETWORKS; METFORMIN; OBESITY
GJ natural language processing; network analysis
GJ data mining; text mining; learning analytics; deep learning;network centrality analysis
Upvotes: 2
Reputation: 6438
Let's try spliting the problem. There are two main logic processes in your code:
Here is the code:
def iter_lines(fo):
cur_line = []
for row in fo:
if not row.startswith(' ') and cur_line:
yield ' '.join(cur_line)
cur_line = [] # reset the cache
cur_line.append(row.strip())
# yield the last line
if cur_line:
yield ' '.join(cur_line)
with open(input_file, encoding="utf8") as fo:
for line in iter_lines(fo):
if line.startswith('GJ'):
keywords = [k.strip() for k in line[2:].split(';')]
print(keywords)
Upvotes: 1