Reputation:
I am trying to get allowed and disallowed parts of a user agent in robots.txt file of netflix website using following code:-
robots="""
User-agent: *
Disallow: /
User-agent: googlebot
User-agent: Googlebot-Video
User-agent: bingbot
User-agent: Baiduspider
User-agent: Baiduspider-mobile
User-agent: Baiduspider-video
User-agent: Baiduspider-image
User-agent: NaverBot
User-agent: Yeti
User-agent: Yandex
User-agent: YandexBot
User-agent: YandexMobileBot
User-agent: YandexVideo
User-agent: YandexWebmaster
User-agent: YandexSitelinks
User-agent: SeznamBot
Allow: /
Disallow: /accountstatus
Disallow: /AccountStatus
Disallow: /aui/inbound
Disallow: /authenticate
Disallow: /autologin
Disallow: /clearcookies
Disallow: /companies
Disallow: /dvdterms
Disallow: /editpayment
Disallow: /emailunsubscribe
Disallow: /error
Disallow: /eula
Disallow: /geooverride
Disallow: /help
Disallow: /imagelibrary
Disallow: /learnmorelayer
Disallow: /learnmorelayertv
Disallow: /login
Disallow: /loginhelp
Disallow: /loginhelp/lookup
Disallow: /loginhelpsucess
Disallow: /LoginHelp
Disallow: /password
Disallow: /logout
Disallow: /Logout
Disallow: /mcd
Disallow: /modernizr
Disallow: /n/
Disallow: /notamember
Disallow: /notfound
Disallow: /notices
Disallow: /nrdapp
Disallow: /optout
Disallow: /overviewblockseeother
Disallow: /popup/codewhatisthis
Disallow: /popupdetails
Disallow: /PopupDetails
Disallow: /popupprivacypolicy
Disallow: /privacypolicychanges
Disallow: /registration
Disallow: /rememberme
Disallow: /signout
Disallow: /signurl
Disallow: /subscriptioncancel
Disallow: /tastesurvey
Disallow: /termsofusechanges
Disallow: /tvsignup
Disallow: /upcomingevents
Disallow: /verifyidentity
Disallow: /whysecure
Disallow: /arabic
Disallow: /Arabic
Disallow: /chinese
Disallow: /Chinese
Disallow: /korean
Disallow: /Korean
Disallow: /airtel
Disallow: /anan
Disallow: /bouyguestelecom
Disallow: /britishairways
Disallow: /brutus
Disallow: /comhem
Disallow: /courts
Disallow: /csl
Disallow: /elisa
Disallow: /entertain
Disallow: /FireTV
Disallow: /firetv
Disallow: /freemonth
Disallow: /kpn
Disallow: /lg
Disallow: /maxis
Disallow: /Maxis
Disallow: /meo
Disallow: /Meo
Disallow: /orangefrance
Disallow: /Panasonic
Disallow: /panasonic
Disallow: /playstation
Disallow: /proximus
Disallow: /qantas
Disallow: /samsung
Disallow: /Sony
Disallow: /sony
Disallow: /talktalk
Disallow: /tdc
Disallow: /telenor
Disallow: /telfort
Disallow: /tim
Disallow: /virginaustralia
Disallow: /vodafone
Disallow: /vodafonedemobilelaunch
Disallow: /xboxone
Disallow: /xfinity
Disallow: /xs4all
Disallow: /ziggo
Disallow: /accountaccess
Disallow: /AccountAccess
Disallow: /activate
Disallow: /Activate
Disallow: /app
Disallow: /BillingActivity
Disallow: /browse
Disallow: /browse/*
Allow: /browse/genre/*
Disallow: /CancelPlan
Disallow: /ChangePlan
Disallow: /changeplan
Disallow: /deviceManagement
Disallow: /DoNotTest
Disallow: /EditProfiles
Disallow: /email
Disallow: /EmailPreferences
Disallow: /entrytrap
Disallow: /HdToggle
Disallow: /LanguagePreferences
Disallow: /ManageDevices
Disallow: /ManageProfiles
Disallow: /MoviesYouveSeen
Disallow: /MyListOrder
Disallow: /NewWatchInstantlyRSS
Disallow: /NewWatchInstantlyRSS/*
Disallow: /payment
Disallow: /Payment
Disallow: /phonenumber
Disallow: /pin
Disallow: /profiles
Disallow: /profiles/*
Disallow: /ProfilesGate
Disallow: /search
Disallow: /search/*
Disallow: /viewingactivity
Disallow: /WiViewingActivity
Disallow: /yourAccount
Disallow: /youraccount
Disallow: /YourAccount
Disallow: /YourAccountPayment
User-agent: AdsBot-Google
User-agent: Twitterbot
User-agent: Adidxbot
Allow: /
User-agent: Yahoo Pipes 1.0
User-agent: Facebot
User-agent: externalfacebookhit
Disallow: /
"""
strt=0
ad=0
robots=''.join(robots.lower().split(' '))
for line in robots.split('\n'):
if line!='':
if ('user-agent:yeti' in line or strt==1) or ('user-agent' not in line and ad==0):
strt=1
print(line)
if 'allow' in line or 'disallow' in line:
ad=1
I am using this code to print out allowed and disallowed parts of user agent yeti but it's little confusing. Can anyone suggest regex or improve this code. I am using python here.
Upvotes: 2
Views: 949
Reputation: 492
The following script will read the robots.txt file from top to bottom splitting on newline. Most likely you won't be reading robots.txt from a string, but something more like an iterator.
When the User-agent label is found, start creating a list of user agents. Multiple user agents share a set of Disallowed/Allowed permissions.
When an Allowed or Disallowed label is identified, emit that permission for each user-agent associated with the permission block.
Emitting the data in this manner will allow you to sort or aggregate the data for whichever use case you need.
def robot_permissions(permission_string):
user_agents = []
new_block = True
for l in permission_string.split("\n"):
clean_l = l.strip()
if len(clean_l) > 0:
(tag, value) = l.split(":")
tag = tag.strip()
value = value.strip()
if tag == "User-agent":
if new_block:
user_agents = []
new_block = False
user_agents.append(value)
else:
new_block = True
for agent in user_agents:
yield (tag, value, agent)
def agent_filter(piter, filter_agent):
for tag, value, agent in piter:
if agent == filter_agent:
yield (tag, value, agent)
if __name__ == "__main__":
piter = robot_permissions(robots)
for p in agent_filter(piter, "Yeti"):
print(p)
('Allow', '/', 'Yeti')
('Disallow', '/accountstatus', 'Yeti')
('Disallow', '/AccountStatus', 'Yeti')
('Disallow', '/aui/inbound', 'Yeti')
('Disallow', '/authenticate', 'Yeti')
('Disallow', '/autologin', 'Yeti')
('Disallow', '/clearcookies', 'Yeti')
('Disallow', '/companies', 'Yeti')
('Disallow', '/dvdterms', 'Yeti')
('Disallow', '/editpayment', 'Yeti')
('Disallow', '/profiles/*', 'Yeti')
('Disallow', '/ProfilesGate', 'Yeti')
('Disallow', '/search', 'Yeti')
('Disallow', '/search/*', 'Yeti')
('Disallow', '/viewingactivity', 'Yeti')
('Disallow', '/WiViewingActivity', 'Yeti')
('Disallow', '/yourAccount', 'Yeti')
('Disallow', '/youraccount', 'Yeti')
('Disallow', '/YourAccount', 'Yeti')
('Disallow', '/YourAccountPayment', 'Yeti')
Upvotes: 1