Reputation: 180
Hi everyone I am currently trying to get some data from urls and then trying to predict what category should that article belong. So far I have done this but it has an error:
info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i in info.index:
response = requests.get(info.iloc[i,0])
soup = BeautifulSoup(response.text, 'html.parser')
html.append([re.sub(r'<.*?>','',
str(soup.findAll(['p','h1','\href="/avtorji/'])))])
category.append(info.iloc[0,i])
data = pd.DataFrame()
data['html'] = html
data['category'] = category
And the error is this:
IndexError: single positional indexer is out-of-bounds.
Can someone help me please?
Upvotes: 0
Views: 1365
Reputation: 4521
You can avoid the iloc call and use iterrows
instead, and I think you would have to use loc
instead of iloc
because you were operating on the index, but using iloc
and loc
in loops is generally not that efficient. You can try the following code (with waiting time inserted):
import time
info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i, row in info.iterrows():
url= row.iloc[0]
time.sleep(2.5) # wait 2.5 seconds
response = requests.get(url) # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
soup = BeautifulSoup(response.text, 'html.parser')
html.append([re.sub(r'<.*?>','',
str(soup.findAll(['p','h1','\href="/avtorji/'])))])
# the following iloc was probably raising the error, because you access the ith column in the first row of your df
# category.append(info.iloc[0,i])
category.append(row.iloc[0]) # not sure which field you wanted to access here, you should also replace it by row['name']
data = pd.DataFrame()
data['html'] = html
data['category'] = category
In case you really only need the url in your loop, you replace:
for i, row in info.iterrows():
url= row.iloc[0]
By something like:
for url in info[put_the_name_of_the_url_column_here]: # or info.iloc[:,0] as proposed by serge
Upvotes: 1
Reputation: 148880
The error is likely to be caused by passing an index to iloc
: loc
expect index values and column names, while iloc
expect numerical position of rows and columns. Furthermore, you have interchanged row and column position for category
with category.append(info.iloc[0,i])
. So you should at least do:
for i in range(len(info)):
response = requests.get(info.iloc[i,0])
...
category.append(info.iloc[i,0])
But as you are trying to iterate the first column of a dataframe, above code is not Pythonic. It is better to directly use the column:
for url in info.loc[:, 0]:
response = requests.get(url)
...
category.append(url)
Upvotes: 1