Shielding Your Site from Public Search Engines and Yale’s Google Search Appliance
The techniques outlined on this page will cause search engines to bypass your website content so it will not appear in either public or Yale search results. These techniques are not designed to protect confidential content from public view. Please contact Yale's Webmaster if you need a more confidential means to share web materials with Yale colleagues.
Techniques for web professionals
There are three options to prevent search engine “crawlers” from finding and indexing your site content:
- Asking us to include your site in the main "robots.txt" file on Yale's web server to prevent indexing by search engines
- HTML “meta” tags used to control search engine access to specific pages within your site
- Asking us to include your site or server URL in our “do not search” list on the Yale Google Search Appliance
Using Robots.txt files
This is a simple text file that is located in the main directory of the web server and is used to exclude or control search engine access. The robots.txt file on www.yale.edu is maintained by the the Yale Webmaster. Please email email@example.com to have your site added to robots.txt.
The “do not search” list on Yale’s Google Search Appliance
You can contact the Yale Webmaster and tell us the Web site URL or Web server URL that you wish to exclude from Yale’s Google Search Appliance master index. This will exclude your site from local searches using the Yale Google Search Appliance, but will not exclude your site from public Web search services unless you also use other techniques (like robots.txt files or meta tags) to shield your content from public Web servers. If you have confidential information within your Web site and need a higher level of access control, please contact the Yale Webmaster for alternatives to controlling the security of your Web content.
USING HTML META TAGS TO CONTROL SEARCH ENGINE ACCESS TO SPECIFIC PAGES
You can use standard HTML meta tags to tell search engine crawlers that you do not want then to index the content of the HTML file. Add this meta tag to the header area of your HTML page:
<meta name="robots" content="noindex, nofollow">
This will exclude the contents of that particular Web page from search engine indexing, and will prevent search engine crawlers from following links on the page.
A warning about all these “no search” techniques
These search exclusion techniques do not work instantly. If your pages have been previously crawled and indexed they may still appear in search results until the next time the search engine crawler visits the page and honors your “no search” request. For the Yale Google Search Appliance this will currently take about 72 hours or less, but on public search engines like Google.com the process of removing your content from their indexes may take much longer (sometimes weeks) because the large public search engines do not crawl your pages very often.