Getting data from url and putting it into DataFrame

Question

Hi everyone I am currently trying to get some data from urls and then trying to predict what category should that article belong. So far I have done this but it has an error:

    info = pd.read_csv('labeled_urls.tsv',sep='	',header=None)
    html, category = [], []
    for i in info.index:
        response = requests.get(info.iloc[i,0])
        soup = BeautifulSoup(response.text, 'html.parser')
        html.append([re.sub(r'<.*?>','', 
                      str(soup.findAll(['p','h1','\href="/avtorji/'])))])
        category.append(info.iloc[0,i])

    data = pd.DataFrame()
    data['html'] = html
    data['category'] = category

And the error is this:

IndexError: single positional indexer is out-of-bounds.

Can someone help me please?

jottbe · Accepted Answer

You can avoid the iloc call and use iterrows instead, and I think you would have to use loc instead of iloc because you were operating on the index, but using iloc and loc in loops is generally not that efficient. You can try the following code (with waiting time inserted):

import time

info = pd.read_csv('labeled_urls.tsv',sep='	',header=None)
html, category = [], []
for i, row in info.iterrows():
    url= row.iloc[0]
    time.sleep(2.5)  # wait 2.5 seconds
    response = requests.get(url)  # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
    soup = BeautifulSoup(response.text, 'html.parser')
    html.append([re.sub(r'<.*?>','', 
                  str(soup.findAll(['p','h1','\href="/avtorji/'])))])
    # the following iloc was probably raising the error, because you access the ith column in the first row of your df
    # category.append(info.iloc[0,i])
    category.append(row.iloc[0])  # not sure which field you wanted to access here, you should also replace it by row['name']

data = pd.DataFrame()
data['html'] = html
data['category'] = category

In case you really only need the url in your loop, you replace:

for i, row in info.iterrows():
    url= row.iloc[0]

By something like:

for url in info[put_the_name_of_the_url_column_here]: # or info.iloc[:,0] as proposed by serge

Getting data from url and putting it into DataFrame

Answers (2)

Related Questions