Nour
Nour

Reputation: 75

Creating a table which has sentences from a paragraph each on a row with Python

I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row. I would want a table that looks like this

abstractID  SentenceID   Sentence

a9001755    0000001      Myxococcus xanthus development is regulated by(1st sentence)

a9001755    0000002      The C signal appears to be the polypeptide product (2nd sentence)

and another table NSFClasses having abstractID and nsfOrg. How to write sentences (each on a row) to table and assign sentenceId as shown above?

This is my code:

import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
    fileA = open(name,'r');
    for line in fileA:
         if line.find(fileNo)!= -1:
             file = line[14:]
         if line.find(org) != -1:
             nsfOrg = line[14:].split()
    print file
    print nsfOrg
    fileA = open(name,'r')
    content = fileA.read().split(':')
    abstract = content[len(content)-1]
    abstract = abstract.replace('\n','')
    abstract = abstract.split();
    abstract = ' '.join(abstract)
    sentences = abstract.split('.')
    print sentences
    key = str(len(sentences))
    print "Sentences--- "

Upvotes: 3

Views: 480

Answers (1)

Two-Bit Alchemist
Two-Bit Alchemist

Reputation: 18467

As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.

import glob

for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
    fh = open(filename, 'r')
    abstract = fh.read().split(':')[-1]
    fh.seek(0)  # reset file pointer
    # See comments below
    for line in fh:
        if line.find('File') != -1:
            absID = line[14:]
            print absID
        if line.find('NSF Org') != -1:
            print line[14:].split()
    # End see comments
    fh.close()
    concat_abstract = ''.join(abstract.replace('\n', '').split())
    for s_id, sentence in enumerate(concat_abstract.split('.')):
        # Adjust numeric width arguments to prettify table
        print absID.ljust(15),
        print '{:06d}'.format(s_id).ljust(15),
        print sentence

In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.

Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.

Upvotes: 1

Related Questions