chaza68
chaza68

Reputation: 113

Python string text file with \n separating words will not split

I was given a long .txt file that when read returns one long string that is a large corpus of words that are separated by \n as shown:

\na+\nabound\nabounds\nabundance\nabundant\naccessable\naccessible\nacclaim\nacclaimed\nacclamation\naccolade\naccolades\naccommodative\naccomodative\naccomplish\naccomplished\naccomplishment...\nworld-famous\nworth\nworth-while\nworthiness\nworthwhile\nworthy\nwow\nwowed\nwowing\nwows\nyay\nyouthful\nzeal\nzenith\nzest\nzippy\n

I need to split this string into a list of these words but none of the commands I usually use for .csv files is working. I have tried stripping, replacing(), split(), splitline() and nothing will break this into a list of these words. I would be grateful for any assistance.

punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '[',']','@']
punctuation_chars2=["'", '"', ",", ".", "!",":",";",'#','[',']','@','\n']
    # list of positive words to use
    positive_words = []
    wrd_list = []
    new_list = []
    with open("positive_words.txt", 'r', encoding="utf-16") as pos_f:
        for lin in pos_f:
            if lin[0] != ';' and lin[0] != '\n':
                positive_words.append(lin.strip())

        pos_wrds = positive_words[0]
        pos_wrds.strip()
        print(pos_wrds)
        for p in punctuation_chars:
            pos_wrds = pos_wrds.replace(p,"")
        print(pos_wrds)


wrd_list = pos_wrds.splitlines()
new_list = wrd_list[-1].splitlines

I would like to see a python list with each word separated: list = [a+, abound, abounds, abundance, abundant...]

Upvotes: 1

Views: 2239

Answers (2)

Rich Lysakowski PhD
Rich Lysakowski PhD

Reputation: 3093

string.splitlines() work on the lines of Python text file.

A Python text file is an ordered collection (sequence) of lines. Each line is a string terminated with "\n". So using positive_words.append(lin.split('\\n')) works because for your file you must escape the backslash character for it to be treated as a backslash and not as a newline "\n" character.

'''
print('\na+\nabound\nabounds\nabundance\nabundant\naccessable\naccessible\nacclaim\nacclaimed\nacclamation\naccolade\naccolades\naccommodative\naccomodative\naccomplish\naccomplished\naccomplishment...\nworld-famous\nworth\nworth-while\nworthiness\nworthwhile\nworthy\nwow\nwowed\nwowing\nwows\nyay\nyouthful\nzeal\nzenith\nzest\nzippy\n')
'''

# punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '[',']','@']
# punctuation_chars2=["'", '"', ",", ".", "!",":",";",'#','[',']','@','\n']

# list of positive words to use
positive_words = []
wrd_list = []
new_list = []
with open("positive_words.txt", 'r', encoding="utf-8") as pos_f:
    for lin in pos_f:
        positive_words.append(lin.split('\\n'))

    pos_wrds = positive_words[0]

print(pos_wrds)

#    for p in punctuation_chars:
#        pos_wrds = pos_wrds.replace(p,"----")
#    print(pos_wrds)

# wrd_list = pos_wrds.splitlines(0)
# new_list = wrd_list[-1].splitlines()

Your last 6 lines need to be modified, because they are using string methods on a list, which is throwing errors.

You need to test for punctuation and non-alphanumeric characters explicitly, because your file has punctuation in one element "accomplishment..." and "a+" in another.

Test each list item separately as a string in the pos_wrds list. Also, your punctuation list has "\n" and "@", which are control characters and special characters (technically not punctuation characters).

If you really need to test for punctuation, then use the Python string package to test for characters in the punctuation character set.

See Best way to strip punctuation from a string in Python for more information on the String library. It is awesomely powerful !!

Upvotes: 0

Prashanti
Prashanti

Reputation: 173

splitlines works pretty well:

In [1]: text = "\na+\nabound\nabounds\nabundance\nabundant\naccessable\naccessible\nacclaim\nacclaimed\nacclamation\naccolade\naccolades\naccommodative\naccomodative\naccomplish\naccomplished\naccomplishment...\nworld-famous\nworth\nw
   ...: orth-while\nworthiness\nworthwhile\nworthy\nwow\nwowed\nwowing\nwows\nyay\nyouthful\nzeal\nzenith\nzest\nzippy\n"                                                                                                                 

In [2]: text.splitlines()                                                                                                                                                                                                                 
Out[2]: 
['',
 'a+',
 'abound',
 'abounds',
 'abundance',
 'abundant',
 'accessable',
 'accessible',
 'acclaim',
 'acclaimed',
 'acclamation',
 'accolade',
 'accolades',
 'accommodative',
 'accomodative',
 'accomplish',
 'accomplished',
 'accomplishment...',
 'world-famous',
 'worth',
 'worth-while',
 'worthiness',
 'worthwhile',
 'worthy',
 'wow',
 'wowed',
 'wowing',
 'wows',
 'yay',
 'youthful',
 'zeal',
 'zenith',
 'zest',
 'zippy']

Upvotes: 2

Related Questions