Reputation: 199
Python 2.7 I am trying to write the result of a "robot check" (although I suppose this applies in other circumstances) where I am iterating over a data frame. I have tried
import robotparser
import urlparse
import pandas as pd
df = pd.DataFrame(dict(A=['http://www.python.org'
,'http://www.junksiteIamtellingyou.com'
]))
df
A
0 http://www.python.org
1 http://www.junksiteIamtellingyou.com
agent_name = 'Test'
for i in df['A']:
try:
parser = robotparser.RobotFileParser()
parser.set_url(urlparse.urljoin(i,"robots.txt"))
parser.read()
except Exception as e:
df['Robot'] = 'No Robot.txt'
else:
df['Robot'] = parser.can_fetch(agent_name, i)
df
A Robot
0 http://www.python.org No Robot.txt <<<-- NOT CORRECT
1 http://www.junksiteIamtellingyou.com No Robot.txt
What is happening, of course, is the last value of the iteration is writing over the entire column of values. The value of Robot should be 'True' (which can be demonstrated by deleting the junk URL from the data frame.
I have tried some different permutations of .loc, but can't get them to work. They always seem to add rows as opposed to update the new column for the existing row.
So, is there a way to specify the column being updated (with the function result)? Perhaps using .loc(location), or perhaps there is another way such as using lambda? I would appreciate your help.
Upvotes: 1
Views: 1176
Reputation: 879103
There is an apply
for that:
import robotparser
import urlparse
import pandas as pd
df = pd.DataFrame(dict(A=['http://www.python.org'
,'http://www.junksiteIamtellingyou.com']))
def parse(i, agent_name):
try:
parser = robotparser.RobotFileParser()
parser.set_url(urlparse.urljoin(i, "robots.txt"))
parser.read()
except Exception as e:
return 'No Robot.txt'
else:
return parser.can_fetch(agent_name, i)
df['Robot'] = df['A'].apply(parse, args=('Test',))
Upvotes: 4