Ho trovato un altro articolo che spiega come "ottimizzare" smf
www . simplemachines.org/community/index.php?topic=251309.0
Ho trovato anche un file robots.txt
###################################
YouPosted.com Smart Robots v3.05
###################################
This is a smart robots.txt which logs the ip and user agent of every visitor.
Due to the compatibility issues between different bots and whether they support
wildcards (*), multiple user-agents and end-anchors ($), I am providing different
blocks for some.
Detected Spider/Bot: None
Headers Sent:
Content-Type: text/plain
Expires: Mon, 13 Oct 2008 03:16:05 GMT (12 hour validity)
My Sitemap - I don't provide it just for the fun of it
Sitemap: www . youposted.com/sitemap.xml
Google - Most Important bot
Unfortunately a robots.txt will only stop it crawling certain urls, and NOT adding any
urls which it comes across into its index. So we're relying on a meta noindex tag.
User-agent: Googlebot
Don't index mobile versions
Disallow: /index.php?;wap
Disallow: /index.php?;wap2
Disallow: /index.php?*;imode
Yahoo - Too aggressive
So limit it as much as possible.
User-agent: Slurp
Disallow Everything
Disallow: /
Now allow bits and then disallow bits
Allow: /sitemap.xml$
Allow: /robots.txt$
Allow: /index.php$
Allow: /index.php?topic=.0$
Allow: /index.php?topic=.0$
Allow: /index.php?topic=.5$
Allow: /index.php?board=.0$
Allow: /index.php?board=*.0$
Allow: /index.php?board=.*5$
But don't allow these
Disallow: /index.php?.msg
Disallow: /index.php?topic=.msg0$
Disallow: /index.php?topic=.msg5$
Disallow: /index.php?.new
Anything with a ; disallow
Disallow: /index.php?;
Arcade Related
Allow: /index.php?action=arcade$
Allow: /index.php?action=stats$
Allow: /index.php?action=arcade;sa=play;game=
Bad bot - Often ignores robots.txt - Waste of bandwidth
Despite claiming on their website to be a search engine in development
I'm suspicious as to whether they are a harvester pretending to be SE
User-agent: Twiceler
Disallow: /
User-agent: W3C-checklink
Disallow: /
Stop following PHPSESSID's
User-agent: MJ12bot
Disallow: /index.php?PHPSESSID
Catch all (remainder)
Will be followed by any bots other than ones identified above
Uses BASIC robots.txt directives without wildcards, end-anchors etc
So Spiders should understand these (including MSNBOT)
User-agent: *
Default SMF Folders
Disallow: /attachments/
Disallow: /Packages/
Disallow: /Smileys/
Disallow: /Sources/
Disallow: /Themes/
Default SMF Actions
Disallow: /index.php?action=activate
Disallow: /index.php?action=admin
Disallow: /index.php?action=calendar
Disallow: /index.php?action=emailuser
Disallow: /index.php?action=findmember
Disallow: /index.php?action=help
Disallow: /index.php?action=helpadmin
Disallow: /index.php?action=login
Disallow: /index.php?action=logout
Disallow: /index.php?action=mlist
Disallow: /index.php?action=modifykarma
Disallow: /index.php?action=pm
Disallow: /index.php?action=post
Disallow: /index.php?action=printpage
Disallow: /index.php?action=profile
Disallow: /index.php?action=recent
Disallow: /index.php?action=register
Disallow: /index.php?action=reminder
Disallow: /index.php?action=search
Disallow: /index.php?action=theme
Disallow: /index.php?action=unread
Disallow: /index.php?action=unreadreplies
Disallow: /index.php?action=verificationcode
Disallow: /index.php?action=who
Disallow: /index.php?theme
SMF Mod Related
Disallow: /archive.php
Disallow: /index.php?action=blog
Disallow: /index.php?action=viewblog
Disallow: /index.php?action=chess
Disallow: /index.php?action=comment
Disallow: /index.php?action=downloads
Disallow: /index.php?action=links
Disallow: /index.php?action=reporttm
Disallow: /index.php?action=recenttopics
Disallow: /index.php?action=mm
Disallow: /index.php?action=sitemap
Disallow: /index.php?action=staff
Disallow: /index.php?action=tags
Disallow: /index.php?action=thankyou
Disallow: /index.php?action=viewkarma
Disallow: /index.php?action=viewers
Disallow: /index.php?f=
Disallow: /index.php?filter
Disallow: /index.php?referredby
Disallow: /Games/
Disallow: /Downloads/
Disallow: /index.php?action=arcade;favorites
Disallow: /index.php?action=arcade;sa=highscore
Disallow: /index.php?action=arcade;sa=play;random
Disallow: /index.php?action=arcade;category
Disallow: /index.php?action=arcade;sort
Disallow: /index.php?action=arcade;stats
Disallow: /index.php?action=stats;expand
Disallow: /index.php?action=stats;collapse
Ho provato per curiosità a fare il comando site:www . youposted.com (il sito che ha il file robots.text) e devo dire che è ben indicizzato.