• Super User

    Inktomi/Slurp non rispetta robots.txt

    Le mie [url=http://en.wikipedia.org/wiki/Honeypot_%28electronics%29]honeypot mi dicono che Inktomi/Slurp (Yahoo! Search) non rispetta robots.txt.

    Esempio:

    User-agent: *
    Disallow: /honeypot/
    

    Questa directory viene richiesta periodicamente da Slurp, es.:

    [url=http://whois.sc/66.196.91.15]66.196.91.15
    User-agent: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

    Qualcun altro rileva lo stesso comportamento?


  • Super User

    Ho segnalato il problema a Yahoo! Search. Vi copincollo la loro risposta, che penso possa essere di pubblico interesse:

    Hello Xxx,

    Thanks for writing the Yahoo! Search and Directory Support.

    I have investigated the issue and it seems that your robots.txt is written in a way that will prevent our crawler from recognizing the excluded directories. When Slurp access the robots.txt it will search for either user-agent: slurp OR * (asterisk), once it finds one or the other it will obey the exclusions and not look further. In your specific case that means that Slurp will only see the following:

    User-Agent: Slurp
    Crawl-Delay: 20

    Since Slurp does not look any further, it does not see the exclusions for user-agent: *. To remedy this, I would suggest adding the excluded folder to the user-agent: slurp part of your robots.txt, so it will read:

    User-Agent: Slurp
    Crawl-Delay: 20
    Disallow: /honeypot/
    Disallow: /etc/

    Once you update this, Slurp should stop crawling those folders within a day or two. I apologize for the inconvenience.

    For answers to other questions you may have regarding Yahoo! Search, please see:
    http://help.yahoo.com/help/us/ysearch/

    For answers to other questions you may have regarding the Yahoo!
    Directory, please see:
    http://help.yahoo.com/help/us/dir/

    Xxx

    Search & Directory Support
    Yahoo! Inc.

    Original Message Follows:

    Mail-Id: xxx
    I'd like to ask about ...
    Reporting a Problem

    What is your name?
    Xxx Xxx

    Please enter your question, comment, or suggestion:
    Hello,

    Slurp is not respecting the robots.txt file at
    http://www.example.com/robots.txt
    (/honeypot/ is disallowed for all user-agents).

    ============================================
    2006-02-22 (Wed) 14:03:02

    "HTTP_HOST: www.example.com"
    "HTTP_ACCEPT: */*"
    "HTTP_USER_AGENT: Mozilla/5.0"
    

    (compatible; Yahoo! Slurp;
    http://help.yahoo.com/help/us/ysearch/slurp)
    "HTTP_ACCEPT_ENCODING: gzip, x-gzip"
    "REMOTE_ADDR: 66.196.91.137"
    "SERVER_PROTOCOL: HTTP/1.0"
    "REQUEST_METHOD: GET"
    "QUERY_STRING: "
    "REQUEST_URI: /honeypot/"

    Xxx Xxx
    Webmaster @ Example.com

    While Viewing:

    Yahoo ID: unknown : no amt link
    Browser: Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.0.1)
    Gecko/20060111 Firefox/1.5.0.1
    REMOTE_ADDR: xxx.xxx.xxx.xxx
    REMOTE_HOST: unknown
    Date Originated: Wednesday February 22, 2006 - 05:42:15