Reputation: 441
I am using Python 3.7 and have a test.txt file that looks like this:
<P align="left">
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$ and
$ per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
“ ”.
</FONT>
I need to extract everything that follows the "be between" (row 4) until "per share" (row 7). Here is the code I run:
price = []
with open("test.txt", 'r') as f:
for line in f:
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
print(price)
['of our common stock is expected to be between']
I first locate the "be between" and then ask to append the line, but the problem is that everything that comes next is cut because it is in the following lines.
My desired output would be:
['of our common stock is expected to be between $ and $ per share']
How can I do it? Thank you very much in advance.
Upvotes: 1
Views: 122
Reputation: 485
This will also work:
import re
price = []
with open("test.txt", 'r') as f:
for line in f:
price.append(line.rstrip().replace(' ',''))
text_file = " ".join(price)
be_start = re.search("be between", text_file).span()[0]
share_end = re.search("per share", text_file).span()[1]
final_file = text_file[be_start:share_end]
print(final_file)
Output:
"be between $and $per share"
Upvotes: 0
Reputation: 1
Here is another simple solution:
It collects all lines into 1 long string, detects starting index of 'be between'
, ending index of 'per share'
, and then takes the appropriate substring.
from re import search
price = []
with open("test.txt", 'r') as f:
one_line_txt = ''.join(f.readlines()).replace('\n', ' ').replace(' ','')
start_index = search('be between', one_line_txt).span()[0]
end_index = search('per share', one_line_txt).span()[1]
print(price.append(one_line_txt[start_index:end_index]))
Outputs:
['be between $and $per share']
Upvotes: 0
Reputation: 31
dirty way of doing it:
price = []
with open("test.txt", 'r') as f:
for i,line in enumerate(f):
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
if i > 3 and i <= 6:
price.append(line.rstrip().replace(' ',''))
print(str(price).split('.')[0]+"]")
Upvotes: 0
Reputation: 4618
this also works:
import re
with open('test.txt','r') as f:
txt = f.read()
start = re.search('\n(.*?)be between\n',txt)
end = re.search('per(.*?)share',txt,re.DOTALL)
output = txt[start.span()[1]:end.span()[0]].replace(' ','').replace('\n','').replace('and',' and ')
print(['{} {} {}'.format(start.group().replace('\n',''),output,end.group().replace('\n', ' '))])
output:
['of our common stock is expected to be between $ and $ per share']
Upvotes: 0
Reputation: 92854
The right way with html.unescape
and re.search
features:
import re
from html import unescape
price_texts = []
with open("test.txt", 'r') as f:
content = unescape(f.read())
m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
if m:
price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))
print(price_texts)
The output:
[' of our common stock is expected to be between $ and $ per share']
Upvotes: 2
Reputation: 44828
You need to decide when to append a line to price
:
is_capturing = False
is_inside_per_share = False
for line in f:
if "be between" in line and "per share" in line:
price.append(line)
is_capturing = False
elif "be between" in line:
is_capturing = True
elif "per share" in line:
# CAUTION: possible off-by-one error
price.append(line[:line.find('per share') + len('per share')].rstrip().replace(' ',''))
is_capturing = False
is_inside_per_share = False
elif line.strip().endswith("per"):
is_inside_per_share = True
elif line.strip().startswith("share") and is_inside_per_share:
# CAUTION: possible off-by-one error
price.append(line[:line.find('share') + len('share')].rstrip().replace(' ',''))
is_inside_per_share = False
is_capturing = False
if is_capturing:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
This is just a sketch, so you'll probably need to tweak it a little bit
Upvotes: 0