Reputation: 520
I'm trying to automate a text-scraping process that grabs an URL and brings in the following text and converts into a DataFrame (please note this is just an extract and the entire text is about 5 paragraphs in total):
text = "On June 15, 10 offices in the business on reported 324 new cases of
faulty equipment (22 technicalities: 11 in Bristol, 3 in Brussels, 2 in
Berlin, 1 in Burma, 1 in Boston, 1 in Belarus; 302 customer-reported, 228 in
Bristol, 24 in Brussels, 22 in Berlin…)"
I've got thus far, a fellow Stack peer helped me to get the DataFrame set-up:
from dateutil import parser
import numpy as np
empty_df = pd.DataFrame(columns = ["total cases","technicalities", "customer-reported", "bristol","brussels","berlin","burma","boston","belarus"])
date = parser.parse(text.split(',')[0]).strftime('%d %B %Y')
foo = text.split()
area= ["total cases","technicalities", "customer-reported"]
region = ["bristol","brussels","berlin","burma","boston","belarus"]
total_no = []
for i in foo:
if i in area:
total_no.append(foo[foo.index(i) - 1])
else i in region:
total_no.append(foo[foo.index(i) - 2])
empty_df.loc[len(empty_df)] = total_no + [date]
empty_df.replace('no', np.nan)
I was expecting this to work, but the issue I'm having is the for loop doesn't seem to like my code and when I run the region loop separately, the distinct values for each region in different areas of faulty equipment don't show-up:
e.g. Bristol value currently outputs 11 and 11 (i.e. just the first value found), not the intended 11 (technicalities) and 228 (customer-reported).
This is a pretty tough challenge, but the text I'm working with doesn't change in formatting, so I'd really appreciate some help in being able to tell Python how to distinguish. What I want as an output is the following:
all cases bristol brussels berlin
Total faults 324 239 27 24
Technichalities 22 11 3 2
Customer-reported 302 228 24 22
Upvotes: 0
Views: 37
Reputation: 500
I think the main problem would be this line:
total_no.append(foo[foo.index(i) - 2])
The foo.index(i)
here is getting the index of the first instance of the given region
each time.
Instead, you could use enumerate
instead and go for something like this:
for i, word in enumerate(foo):
if word in area:
total_no.append(foo[i - 1])
elif word in region:
total_no.append(foo[i - 2])
FYI: Personally, I think the code is rather messy, and I would also opt for the use of regex, but I'm not very in scraping so I'll allow someone else to potentially weight in their opinion.
Upvotes: 1