user12320641
user12320641

Reputation:

Parse allowed and disallowed parts of robots.txt file

I am trying to get allowed and disallowed parts of a user agent in robots.txt file of netflix website using following code:-

robots="""

    User-agent: *
    Disallow: /

    User-agent: googlebot
    User-agent: Googlebot-Video
    User-agent: bingbot
    User-agent: Baiduspider
    User-agent: Baiduspider-mobile
    User-agent: Baiduspider-video
    User-agent: Baiduspider-image
    User-agent: NaverBot
    User-agent: Yeti
    User-agent: Yandex
    User-agent: YandexBot
    User-agent: YandexMobileBot
    User-agent: YandexVideo
    User-agent: YandexWebmaster
    User-agent: YandexSitelinks
    User-agent: SeznamBot
    Allow: /

    Disallow: /accountstatus
    Disallow: /AccountStatus
    Disallow: /aui/inbound
    Disallow: /authenticate
    Disallow: /autologin
    Disallow: /clearcookies
    Disallow: /companies
    Disallow: /dvdterms
    Disallow: /editpayment
    Disallow: /emailunsubscribe
    Disallow: /error
    Disallow: /eula
    Disallow: /geooverride
    Disallow: /help
    Disallow: /imagelibrary
    Disallow: /learnmorelayer
    Disallow: /learnmorelayertv
    Disallow: /login
    Disallow: /loginhelp
    Disallow: /loginhelp/lookup
    Disallow: /loginhelpsucess
    Disallow: /LoginHelp
    Disallow: /password
    Disallow: /logout
    Disallow: /Logout
    Disallow: /mcd
    Disallow: /modernizr
    Disallow: /n/
    Disallow: /notamember
    Disallow: /notfound
    Disallow: /notices
    Disallow: /nrdapp
    Disallow: /optout
    Disallow: /overviewblockseeother
    Disallow: /popup/codewhatisthis
    Disallow: /popupdetails
    Disallow: /PopupDetails
    Disallow: /popupprivacypolicy
    Disallow: /privacypolicychanges
    Disallow: /registration
    Disallow: /rememberme
    Disallow: /signout
    Disallow: /signurl
    Disallow: /subscriptioncancel
    Disallow: /tastesurvey
    Disallow: /termsofusechanges
    Disallow: /tvsignup
    Disallow: /upcomingevents
    Disallow: /verifyidentity
    Disallow: /whysecure

    Disallow: /arabic
    Disallow: /Arabic
    Disallow: /chinese
    Disallow: /Chinese
    Disallow: /korean
    Disallow: /Korean

    Disallow: /airtel
    Disallow: /anan
    Disallow: /bouyguestelecom
    Disallow: /britishairways
    Disallow: /brutus
    Disallow: /comhem
    Disallow: /courts
    Disallow: /csl
    Disallow: /elisa
    Disallow: /entertain
    Disallow: /FireTV
    Disallow: /firetv
    Disallow: /freemonth
    Disallow: /kpn
    Disallow: /lg
    Disallow: /maxis
    Disallow: /Maxis
    Disallow: /meo
    Disallow: /Meo
    Disallow: /orangefrance
    Disallow: /Panasonic
    Disallow: /panasonic
    Disallow: /playstation
    Disallow: /proximus
    Disallow: /qantas
    Disallow: /samsung
    Disallow: /Sony
    Disallow: /sony
    Disallow: /talktalk
    Disallow: /tdc
    Disallow: /telenor
    Disallow: /telfort
    Disallow: /tim
    Disallow: /virginaustralia
    Disallow: /vodafone
    Disallow: /vodafonedemobilelaunch
    Disallow: /xboxone
    Disallow: /xfinity
    Disallow: /xs4all
    Disallow: /ziggo

    Disallow: /accountaccess
    Disallow: /AccountAccess
    Disallow: /activate
    Disallow: /Activate
    Disallow: /app
    Disallow: /BillingActivity
    Disallow: /browse
    Disallow: /browse/*
    Allow: /browse/genre/*
    Disallow: /CancelPlan
    Disallow: /ChangePlan
    Disallow: /changeplan
    Disallow: /deviceManagement
    Disallow: /DoNotTest
    Disallow: /EditProfiles
    Disallow: /email
    Disallow: /EmailPreferences
    Disallow: /entrytrap
    Disallow: /HdToggle
    Disallow: /LanguagePreferences
    Disallow: /ManageDevices
    Disallow: /ManageProfiles
    Disallow: /MoviesYouveSeen
    Disallow: /MyListOrder
    Disallow: /NewWatchInstantlyRSS
    Disallow: /NewWatchInstantlyRSS/*
    Disallow: /payment
    Disallow: /Payment
    Disallow: /phonenumber
    Disallow: /pin
    Disallow: /profiles
    Disallow: /profiles/*
    Disallow: /ProfilesGate
    Disallow: /search
    Disallow: /search/*
    Disallow: /viewingactivity
    Disallow: /WiViewingActivity
    Disallow: /yourAccount
    Disallow: /youraccount
    Disallow: /YourAccount
    Disallow: /YourAccountPayment

    User-agent: AdsBot-Google
    User-agent: Twitterbot
    User-agent: Adidxbot
    Allow: /

    User-agent: Yahoo Pipes 1.0
    User-agent: Facebot
    User-agent: externalfacebookhit
    Disallow: /
    """

    strt=0
    ad=0
    robots=''.join(robots.lower().split(' '))
    for line in robots.split('\n'):
        if line!='':
            if ('user-agent:yeti' in line or strt==1) or ('user-agent' not in line and ad==0):
                strt=1
                print(line)
                if 'allow' in line or 'disallow' in line:
                    ad=1

I am using this code to print out allowed and disallowed parts of user agent yeti but it's little confusing. Can anyone suggest regex or improve this code. I am using python here.

Upvotes: 2

Views: 949

Answers (1)

Kristian
Kristian

Reputation: 492

Overview

The following script will read the robots.txt file from top to bottom splitting on newline. Most likely you won't be reading robots.txt from a string, but something more like an iterator.

When the User-agent label is found, start creating a list of user agents. Multiple user agents share a set of Disallowed/Allowed permissions.

When an Allowed or Disallowed label is identified, emit that permission for each user-agent associated with the permission block.

Emitting the data in this manner will allow you to sort or aggregate the data for whichever use case you need.

  • Group by User-agent
  • Group by permission: Allowed / Disallowed
  • build a dictionary of paths and associated permission or user-agent
def robot_permissions(permission_string):
    user_agents = []
    new_block = True
    for l in permission_string.split("\n"):
        clean_l = l.strip()
        if len(clean_l) > 0:
            (tag, value) = l.split(":")
            tag = tag.strip()
            value = value.strip()
            if tag == "User-agent":
                if new_block:
                    user_agents = []
                    new_block = False
                user_agents.append(value)
            else:
                new_block = True
                for agent in user_agents:
                    yield (tag, value, agent)

def agent_filter(piter, filter_agent):
    for tag, value, agent in piter:
        if agent == filter_agent:
            yield (tag, value, agent)

if __name__ == "__main__":
    piter = robot_permissions(robots)
    for p in agent_filter(piter, "Yeti"):
        print(p)

Head of robots.txt output from python script

('Allow', '/', 'Yeti')
('Disallow', '/accountstatus', 'Yeti')
('Disallow', '/AccountStatus', 'Yeti')
('Disallow', '/aui/inbound', 'Yeti')
('Disallow', '/authenticate', 'Yeti')
('Disallow', '/autologin', 'Yeti')
('Disallow', '/clearcookies', 'Yeti')
('Disallow', '/companies', 'Yeti')
('Disallow', '/dvdterms', 'Yeti')
('Disallow', '/editpayment', 'Yeti')

Tail of robots.txt output from python script

('Disallow', '/profiles/*', 'Yeti')
('Disallow', '/ProfilesGate', 'Yeti')
('Disallow', '/search', 'Yeti')
('Disallow', '/search/*', 'Yeti')
('Disallow', '/viewingactivity', 'Yeti')
('Disallow', '/WiViewingActivity', 'Yeti')
('Disallow', '/yourAccount', 'Yeti')
('Disallow', '/youraccount', 'Yeti')
('Disallow', '/YourAccount', 'Yeti')
('Disallow', '/YourAccountPayment', 'Yeti')

Upvotes: 1

Related Questions