Reputation: 71
I need help removing paragraphs from this text file (https://www.gutenberg.org/files/768/768.txt) on Google Colab. I need the text file to start after “[email protected]”, and end before “END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS in order to have an accurate total of the word count. Listed below is the coding that I have so far.
# download and installing pyspark in colab
!pip install -q pyspark
# download Wuthering Heights, by Emily Bronte
!wget -q https://www.gutenberg.org/files/768/768.txt
import os.path
baseDir = os.path.join('data')
inputPath = os.path.join('/content/768.txt')
fileName = os.path.join(baseDir, inputPath)
with open('/content/768.txt','r') as f:
print(f.read())
Upvotes: 2
Views: 364
Reputation: 2124
Just slice the string at the points where you find the text you are looking for.
!wget -q https://www.gutenberg.org/files/768/768.txt
import os.path
baseDir = os.path.join('data')
inputPath = os.path.join('768.txt')
fileName = os.path.join(baseDir, inputPath)
with open('768.txt','r') as f:
text = f.read()
#GET START LOC
start_loc = text.find("[email protected]") + len("[email protected]")
#GET END LOC
end_loc = text[start_loc:].find("***")
#SLICE THE TEXT STRING AND THE INDEXES
text[start_loc:start_loc+end_loc].replace("\n","")
Upvotes: 1
Reputation: 2089
You can use a regex to extract the text between two strings:
import re
text = open('768.txt','r').read()
start = "[email protected]"
end = "END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS"
m = re.search(f'{start}(?s)(.*){end}', text)
print(m.group(1))
Upvotes: 0