martins
martins

Reputation: 441

Python extract text with line cuts

I am using Python 3.7 and have a test.txt file that looks like this:

<P align="left">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and
$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
&#147;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#148;.
</FONT>

I need to extract everything that follows the "be between" (row 4) until "per share" (row 7). Here is the code I run:

price = []
with open("test.txt", 'r') as f:
    for line in f:
        if "be between" in line:
            price.append(line.rstrip().replace('&nbsp;','')) #remove '\n' and '&nbsp;'
print(price)
['of our common stock is expected to be between']

I first locate the "be between" and then ask to append the line, but the problem is that everything that comes next is cut because it is in the following lines.

My desired output would be:

['of our common stock is expected to be between $ and $ per share']

How can I do it? Thank you very much in advance.

Upvotes: 1

Views: 122

Answers (6)

Tyrion
Tyrion

Reputation: 485

This will also work:

import re

price = []    
with open("test.txt", 'r') as f:
    for line in f:
        price.append(line.rstrip().replace('&nbsp;',''))
text_file = " ".join(price)

be_start = re.search("be between", text_file).span()[0]
share_end = re.search("per share", text_file).span()[1]
final_file = text_file[be_start:share_end]
print(final_file)

Output:

"be between $and $per share"

Upvotes: 0

Mirko Drazic
Mirko Drazic

Reputation: 1

Here is another simple solution: It collects all lines into 1 long string, detects starting index of 'be between', ending index of 'per share', and then takes the appropriate substring.

    from re import search
    price = []
    with open("test.txt", 'r') as f:
        one_line_txt = ''.join(f.readlines()).replace('\n', ' ').replace('&nbsp;','')
    start_index = search('be between', one_line_txt).span()[0]
    end_index = search('per share', one_line_txt).span()[1]
    print(price.append(one_line_txt[start_index:end_index]))

Outputs:

['be between $and $per share']

Upvotes: 0

RR80
RR80

Reputation: 31

dirty way of doing it:

   price = []
    with open("test.txt", 'r') as f:
        for i,line in enumerate(f):
            if "be between" in line:
                price.append(line.rstrip().replace('&nbsp;','')) #remove '\n' and '&nbsp;'
            if i > 3 and i <= 6:
                price.append(line.rstrip().replace('&nbsp;',''))
    print(str(price).split('.')[0]+"]")

Upvotes: 0

Derek Eden
Derek Eden

Reputation: 4618

this also works:

import re

with open('test.txt','r') as f:
   txt = f.read()

start = re.search('\n(.*?)be between\n',txt)
end = re.search('per(.*?)share',txt,re.DOTALL)
output = txt[start.span()[1]:end.span()[0]].replace('&nbsp;','').replace('\n','').replace('and',' and ')
print(['{} {} {}'.format(start.group().replace('\n',''),output,end.group().replace('\n', ' '))])

output:

['of our common stock is expected to be between $ and $ per share']

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

The right way with html.unescape and re.search features:

import re
from html import unescape

price_texts = []
with open("test.txt", 'r') as f:
    content = unescape(f.read())
    m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
    if m:
        price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))

print(price_texts)

The output:

[' of our common stock is expected to be between $ and $ per share']

Upvotes: 2

ForceBru
ForceBru

Reputation: 44828

You need to decide when to append a line to price:

is_capturing = False
is_inside_per_share = False
for line in f:
    if "be between" in line and "per share" in line:
        price.append(line)
        is_capturing = False
    elif "be between" in line:
        is_capturing = True
    elif "per share" in line:
        # CAUTION: possible off-by-one error
        price.append(line[:line.find('per share') + len('per share')].rstrip().replace('&nbsp;',''))
        is_capturing = False
        is_inside_per_share = False
    elif line.strip().endswith("per"):
        is_inside_per_share = True
    elif line.strip().startswith("share") and is_inside_per_share:
        # CAUTION: possible off-by-one error
        price.append(line[:line.find('share') + len('share')].rstrip().replace('&nbsp;',''))
        is_inside_per_share = False
        is_capturing = False

    if is_capturing:
        price.append(line.rstrip().replace('&nbsp;','')) #remove '\n' and '&nbsp;'

This is just a sketch, so you'll probably need to tweak it a little bit

Upvotes: 0

Related Questions