mandella
mandella

Reputation: 180

Getting data from url and putting it into DataFrame

Hi everyone I am currently trying to get some data from urls and then trying to predict what category should that article belong. So far I have done this but it has an error:

    info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
    html, category = [], []
    for i in info.index:
        response = requests.get(info.iloc[i,0])
        soup = BeautifulSoup(response.text, 'html.parser')
        html.append([re.sub(r'<.*?>','', 
                      str(soup.findAll(['p','h1','\href="/avtorji/'])))])
        category.append(info.iloc[0,i])

    data = pd.DataFrame()
    data['html'] = html
    data['category'] = category

And the error is this:

IndexError: single positional indexer is out-of-bounds.

Can someone help me please?

Upvotes: 0

Views: 1365

Answers (2)

jottbe
jottbe

Reputation: 4521

You can avoid the iloc call and use iterrows instead, and I think you would have to use loc instead of iloc because you were operating on the index, but using iloc and loc in loops is generally not that efficient. You can try the following code (with waiting time inserted):

import time

info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i, row in info.iterrows():
    url= row.iloc[0]
    time.sleep(2.5)  # wait 2.5 seconds
    response = requests.get(url)  # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
    soup = BeautifulSoup(response.text, 'html.parser')
    html.append([re.sub(r'<.*?>','', 
                  str(soup.findAll(['p','h1','\href="/avtorji/'])))])
    # the following iloc was probably raising the error, because you access the ith column in the first row of your df
    # category.append(info.iloc[0,i])
    category.append(row.iloc[0])  # not sure which field you wanted to access here, you should also replace it by row['name']

data = pd.DataFrame()
data['html'] = html
data['category'] = category

In case you really only need the url in your loop, you replace:

for i, row in info.iterrows():
    url= row.iloc[0]

By something like:

for url in info[put_the_name_of_the_url_column_here]: # or info.iloc[:,0] as proposed by serge

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 148880

The error is likely to be caused by passing an index to iloc: loc expect index values and column names, while iloc expect numerical position of rows and columns. Furthermore, you have interchanged row and column position for category with category.append(info.iloc[0,i]). So you should at least do:

for i in range(len(info)):
    response = requests.get(info.iloc[i,0])
    ...
    category.append(info.iloc[i,0])

But as you are trying to iterate the first column of a dataframe, above code is not Pythonic. It is better to directly use the column:

for url in info.loc[:, 0]:
    response = requests.get(url)
    ...
    category.append(url)

Upvotes: 1

Related Questions