Pryore
Pryore

Reputation: 520

Python advanced scraping - multiple values from text

I'm trying to automate a text-scraping process that grabs an URL and brings in the following text and converts into a DataFrame (please note this is just an extract and the entire text is about 5 paragraphs in total):

text = "On June 15, 10 offices in the business on reported 324 new cases of 
faulty equipment (22 technicalities: 11 in Bristol, 3 in Brussels, 2 in 
Berlin, 1 in Burma, 1 in Boston, 1 in Belarus; 302 customer-reported, 228 in 
Bristol, 24 in Brussels, 22 in Berlin…)"

I've got thus far, a fellow Stack peer helped me to get the DataFrame set-up:

from dateutil import parser
import numpy as np

empty_df = pd.DataFrame(columns = ["total cases","technicalities", "customer-reported", "bristol","brussels","berlin","burma","boston","belarus"])

date = parser.parse(text.split(',')[0]).strftime('%d %B %Y')

foo = text.split()
area= ["total cases","technicalities", "customer-reported"]
region = ["bristol","brussels","berlin","burma","boston","belarus"]

total_no = []
for i in foo:
    if i in area:
       total_no.append(foo[foo.index(i) - 1])
    else i in region:
       total_no.append(foo[foo.index(i) - 2])

empty_df.loc[len(empty_df)] = total_no + [date]

empty_df.replace('no', np.nan)

I was expecting this to work, but the issue I'm having is the for loop doesn't seem to like my code and when I run the region loop separately, the distinct values for each region in different areas of faulty equipment don't show-up:

e.g. Bristol value currently outputs 11 and 11 (i.e. just the first value found), not the intended 11 (technicalities) and 228 (customer-reported).

This is a pretty tough challenge, but the text I'm working with doesn't change in formatting, so I'd really appreciate some help in being able to tell Python how to distinguish. What I want as an output is the following:

                  all cases   bristol   brussels  berlin  
Total faults          324       239         27     24      
Technichalities       22        11         3       2
Customer-reported     302       228        24      22

Upvotes: 0

Views: 37

Answers (1)

ad2969
ad2969

Reputation: 500

I think the main problem would be this line:

total_no.append(foo[foo.index(i) - 2])

The foo.index(i) here is getting the index of the first instance of the given region each time.

Instead, you could use enumerate instead and go for something like this:

for i, word in enumerate(foo):
    if word in area:
       total_no.append(foo[i - 1])
    elif word in region:
       total_no.append(foo[i - 2])

FYI: Personally, I think the code is rather messy, and I would also opt for the use of regex, but I'm not very in scraping so I'll allow someone else to potentially weight in their opinion.

Upvotes: 1

Related Questions