Reputation: 1

Data structure question for project in Python

For my upcoming project, I am supposed to take a text file that has been scrambled and unscramble it into a specific format.

Each line in the scrambled file contains the line in the text file, a line number, and a three-letter code that identifies the work. Each of these items is separated by the |character. For example,

it ran away when it saw mine coming!"|164|ALC
cried to the man who trundled the barrow; "bring up alongside and help|27|TRI
"Of course he's stuffed," replied Dorothy, who was still angry.|46|WOO

My task is to write a program that reads each line in the text file, separates and unscrambles the lines, and collects the basic data you’d first set out to collect. For each work, I have to determine

its longest line (and the corresponding line number),
its shortest line (and corresponding line number), and
the average length of the lines in the entire work.

The summaries should be sorted by three-letter code and should be formatted as follows:

ALC

Longest Line (107): "No, I didn’t," said Alice: "I don’t think it’s at all a pity. I said Shortest Line (148): to." Average Length: 59

WOO

Longest Line (66): of my way. Whenever I’ve met a man I’ve been awfully scared; but I just Shortest Line (71): go." Average Length: 58

Then I have to make another file and this file should contain the three-letter code for a work followed by its text. The lines must all be included and should be ordered and should not include line numbers or three-letter codes. The lines should be separated by a separator with five dashes. The result should look like the following:

ALC

A large rose-tree stood near the entrance of the garden: the roses growing on it were white, but there were three gardeners at it, busily painting them red. Alice thought this a very curious thing, and she went nearer to watch them, and just as she came up to them she heard one of them say, "Look out now, Five! Don’t go splashing paint over me like that!" "I couldn’t help it," said Five, in a sulky tone; "Seven jogged my elbow." On which Seven looked up and said, "That’s right, Five! Always lay the blame on others!"

-----

TRI SQUIRE TRELAWNEY, Dr. Livesey, and the rest of these gentlemen having asked me to write down the whole particulars about Treasure Island, from the beginning to the end, keeping nothing back but the bearings of the island, and that only because there is still treasure not yet lifted, I take up my pen in the year of grace 17__ and go back to the time when my father kept the Admiral Benbow inn and the brown old seaman with the sabre cut first took up his lodging under our roof. I remember him as if it were yesterday, as he came plodding to the inn door, his sea-chest following behind him in a hand-barrow--a tall, strong, heavy, nut-brown man, his tarry pigtail falling over the

-----

WOO All this time Dorothy and her companions had been walking through the thick woods. The road was still paved with yellow brick, but these were much covered by dried branches and dead leaves from the trees, and the walking was not at all good. There were few birds in this part of the forest, for birds love the open country where there is plenty of sunshine. But now and then there came a deep growl from some wild animal hidden among the trees. These sounds made the little girl’s heart beat fast, for she did not know what made them; but Toto knew, and he walked close to Dorothy’s side, and did not even bark in return.

My question is, what sort of tools or methods for lists or any other data structures that python has would be the best to use for this project where I have to move lines of texts around and unscramble the order of the words themselves? I would greatly appreciate some advice or help with the code.

@Gesucther

The code you posted works, except when the program tries to find the data for the summaries file it only brings back:

TTL
Longest Line (59): subscribe to our email newsletter to hear about new eBooks. Shortest Line (59): subscribe to our email newsletter to hear about new eBooks. Average Length: 0

WOO
Longest Line (59): subscribe to our email newsletter to hear about new eBooks. Shortest Line (59): subscribe to our email newsletter to hear about new eBooks. Average Length: 0

ALG
Longest Line (59): subscribe to our email newsletter to hear about new eBooks. Shortest Line (59): subscribe to our email newsletter to hear about new eBooks. Average Length: 0

Is there something you can see that's causing the average, and the shortest lines to not print out correctly? Or even the longest line.

Here is a download link to the starting text file. https://drive.google.com/file/d/1Dwnk0ziqovEEuaC7r7YzZdkI5_bh7wvG/view?usp=sharing

EDIT*****

It is working properly now but is there a way to change the code so it outputs the line number where the longest and shortest lines are found? Instead of the character count?

TTL
Longest Line (82): *** END OF THE PROJECT GUTENBERG EBOOK TWENTY THOUSAND LEAGUES UNDER THE SEAS  *** 
Shortest Line (1): N 
Average Length: 58

WOO
Longest Line (74): Section 5. General Information About Project Gutenberg-tm electronic works 
Shortest Line (3): it. 
Average Length: 58

ALG
Longest Line (76): 2. Alice through Q.’s 3d (_by railwayv) to 4th (_Tweedledum and Tweedledee_) 
Shortest Line (1): 1 
Average Length: 54

Above, next to longest line it has (76) because it's the character length in the sentence, but is there a way to have it be the line number instead?

EDIT****

It looks like my summary and unscrambled are coming out unalphabetilally? Is there a way to make them come out alphabetical instead?

Upvotes: 0

Answers (2)

Gesuchter

Reputation: 587

if you aren't allowed to use any (3rd party) imports, this approach might help you:

First of all, we need to parse your scrambled file, where

Each line [...] contains the line in the text file, a line number, and a three-letter code that identifies the work. Each of these items is separated by the |character.

INPUT_FILE = "text.txt"
SUMMARIES_FILE = "summaries.txt"
UNSCRAMBLED_FILE = "unscrambled.txt"

books = {}

with open(INPUT_FILE, "r") as f:
    for l in f:
        l = l.strip().split("|")
        text, line, book = l
        
        texts = books.get(book, [])
        texts.append((line, text))
        books[book] = texts

The dictionary authors will now look like this:

{
 'ALC': [('164', 'it ran away when it saw mine coming!"')],
 'TRI': [('27', 'cried to the man who trundled the barrow; "bring up alongside and help')],
 'WOO': [('46', '"Of course he\'s stuffed," replied Dorothy, who was still angry.')]
}

Now we can proceed to the processing of each line (please notice the comments in the code):

with open(SUMMARIES_FILE, "w") as summaries_file, open(UNSCRAMBLED_FILE, "w") as unscrambled_file:
    
    summary = ""
    unscrambled = ""
    
    # iterate over all books 
    for book, texts in books.items():
    
        # sort the lines by line number
        texts = sorted(texts, key=lambda k: int(k[0]))

        unscrambled += f"{book}\n"
        
        total_len = 0
        longest = shortest = None
        
        # iterate over all (sorted) lines of the book
        for text in texts:
            
            line, text = text
            unscrambled += text
            
            length = len(text)

            # keep track of the total length of each line (we need that to calculate the average)
            total_len += length
            
            # check whether the current sentence is the longest one yet
            longest = longest if longest is not None and len(longest[1]) > length else (line, text)
           
            # check whether the current sentence is the smallest one yet
            shortest = shortest if shortest is not None and len(shortest[1]) < length else (line, text)

        unscrambled += "\n-----\n"

        # calculate the average length of lines
        average_len = total_len // len(texts)

        summary += f"{book}\n" \
                   f"Longest Line ({longest[0]}): {longest[1]}\n" \
                   f"Shortest Line ({shortest[0]}): {shortest[1]}\n" \
                   f"Average Lenght: {average_len}\n\n"

    # write results to the appropriate files
    summaries_file.write(summary)
    unscrambled_file.write(unscrambled)

summaries.txt will contain:

ALC
Longest Line (37): it ran away when it saw mine coming!" Shortest Line (37): it ran away when it saw mine coming!" Average Lenght: 37

TRI
Longest Line (70): cried to the man who trundled the barrow; "bring up alongside and help Shortest Line (70): cried to the man who trundled the barrow; "bring up alongside and help Average Lenght: 70

WOO
Longest Line (63): "Of course he's stuffed," replied Dorothy, who was still angry. Shortest Line (63): "Of course he's stuffed," replied Dorothy, who was still angry. Average Lenght: 63

unscrambled.txt will contain:

ALC
it ran away when it saw mine coming!"
-----
TRI
cried to the man who trundled the barrow; "bring up alongside and help
-----
WOO
"Of course he's stuffed," replied Dorothy, who was still angry.
-----

However, this solution might not be as efficient as using pandas.

Upvotes: 0

RJ Adriaansen

Reputation: 9649

I suggest using pandas for this. You can load your data as a dataframe with read.csv once you've added a newline character at the right positions, which can be done with regex:

import pandas as pd
import io
import re

data = '''it ran away when it saw mine coming!"|164|ALC cried to the man who trundled the barrow; "bring up alongside and help|27|TRI "Of course he's stuffed," replied Dorothy, who was still angry.|46|WOO'''
data = re.sub('(?<=[A-Z]{3})\s', '\n', data) # replace space after a word of three captial letters with newline character
df = pd.read_csv(io.StringIO(data), sep='|', names=['text', 'line', 'book'])

This will output the following dataframe:

	text	line	book
0	it ran away when it saw mine coming!"	164	ALC
1	cried to the man who trundled the barrow; "bring up alongside and help	27	TRI
2	Of course he's stuffed, replied Dorothy, who was still angry.	46	WOO

Now you can process the data as you like. For example by getting the number of characters in the lines and printing the desired statistics:

df['length'] = df['text'].str.len()
print('longest string:', df[df['length']==df['length'].max()])
print('shortest string:', df[df['length']==df['length'].min()])
print('average string length:', df['length'].mean())

Or get the full texts of the books by sorting by line number, grouping the data by book and joining the lines per book:

full_texts = df.sort_values(['line']).groupby('book', as_index = False).agg({'text': ' '.join}) 
print('\n\n-----\n\n'.join(full_texts['book']+' '+full_texts['text']))

Result:

ALC it ran away when it saw mine coming!"

-----

TRI cried to the man who trundled the barrow; "bring up alongside and help

-----

WOO Of course he's stuffed, replied Dorothy, who was still angry.

Upvotes: 2

Data structure question for project in Python

Answers (2)

Related Questions