David Frost
David Frost

Reputation: 71

Google Colab Paragraph Removal

I need help removing paragraphs from this text file (https://www.gutenberg.org/files/768/768.txt) on Google Colab. I need the text file to start after “[email protected]”, and end before “END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS in order to have an accurate total of the word count. Listed below is the coding that I have so far.

# download and installing pyspark in colab
!pip install -q pyspark

# download Wuthering Heights, by Emily Bronte
!wget -q https://www.gutenberg.org/files/768/768.txt

import os.path
baseDir = os.path.join('data')
inputPath = os.path.join('/content/768.txt')
fileName = os.path.join(baseDir, inputPath)
with open('/content/768.txt','r') as f:
print(f.read())

Upvotes: 2

Views: 364

Answers (2)

Lewis Morris
Lewis Morris

Reputation: 2124

Just slice the string at the points where you find the text you are looking for.

!wget -q https://www.gutenberg.org/files/768/768.txt
import os.path
baseDir = os.path.join('data')
inputPath = os.path.join('768.txt')
fileName = os.path.join(baseDir, inputPath)
with open('768.txt','r') as f:
    text = f.read()
    
#GET START LOC
start_loc = text.find("[email protected]") + len("[email protected]")
#GET END LOC
end_loc = text[start_loc:].find("***")
#SLICE THE TEXT STRING AND THE INDEXES 
text[start_loc:start_loc+end_loc].replace("\n","")

Upvotes: 1

pacuna
pacuna

Reputation: 2089

You can use a regex to extract the text between two strings:

import re
text = open('768.txt','r').read()

start = "[email protected]"
end = "END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS"

m = re.search(f'{start}(?s)(.*){end}', text)
print(m.group(1))

Upvotes: 0

Related Questions