Robots Exclusion(검색엔진 로봇 정보수집 거부방법)

한글로 덧붙여 놨습니다;; - zyint

Robots Exclusion ¶

Sometimes people find they have been indexed by an indexing robot, 
or that a resource discovery robot has visited part of a site that for some reason shouldn't be visited by robots.
In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to limit what the robot does. 
This is achieved through two mechanisms:

많은 사용자들은 공개되지 않길 원하는 자신의 웹페이지가 검색엔진 로봇에 의해서 정보수집되는 경우를 겪게 됩니다.
이런문제를 해결하기 위해, 많은 웹 로봇 제공자들은 웹 사이트 관리자나 컨텐츠 제공자를 위해서 로봇의 정보수집을 차단하는 기능을 제공합니다.
차단하는 방법에는 두가지 방법이 있습니다.

The Robots Exclusion Protocol ¶

 A Web site administrator can indicate which parts of the site should not be vistsed by a robot, 
by providing a specially formatted file on their site, in http://.../robots.txt.

웹사이트 관리자는 로봇의 정보수집을 원치않는다면 특별한 형식의 파일을 넣음으로써 해결할 수 있습니다. http://.../robots.txt

The Robots META tag ¶

A Web author can indicate if a page may or may not be indexed, 
or analysed for links, through the use of a special HTML META tag.

웹 페이지 제작자는 메타 테그를 이용해서 로봇의 접근을 차단할 수 있습니다.

The remainder of this pages provides full details on these facilities.
Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work for every Robot. 
If you need stronger protection from robots and other agents, you should use alternative methods such as password protection.

아래에는 위 두 가지 방법의 자세한 설명이 덧붙여져 있습니다.
주!)모든 검색엔진 로봇이 위의 차단기능을 제공하는것은 아닙니다. 만일 당신의 홈페이지 정보를 검색엔진 로봇으로부터 완벽하게 보호하기 위해서는 비밀번호와 같은 대체 수단을 사용해야 할것입니다.

----

The Robots Exclusion Protocol ¶

The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot.
In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like:

User-agent: *
Disallow: /

to see if it is allowed to retrieve the document. The precise details on how these rules can be specified, and what they mean, can be found in:

Web Server Administrator's Guide to the Robots Exclusion Protocol
HTML Author's Guide to the Robots Exclusion Protocol

The original 1994 protocol description, as currently deployed.
The revised Internet-Draft specification, which is not yet completed or implemented.

* robots.txt 유의사항
1. 사이트는 하나의 "robots.txt" 만을 가질 수 있다
2. "robots.txt" 문자열은 모두 소문자이어야 하며, 공백은 허용 되지 않는다.
3. 최상위 디렉토리의 robots.txt만 읽는다. (다른 디렉토리의 robots.txt는 아무소용이없다)

User-agent: * Disallow: /	모든 검색엔진이 긁어가는 것 모두 막기
User-agent: * Disallow:	모두 허용하기
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/	cgi-bin 디렉토리, tmp 디렉토리, private 디렉토리 긁어가는 것만 막기
User-agent: empas Disallow: /	엠파스 검색로봇만 긁어가기 제외
User-agent: webCrawler Disallow:	웹크롤러 검색로봇만 긁어가기 허락

----

The Robots META tag ¶

The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required.
Note that currently only a few robots implement this.

In this simple example:

a robot should neither index this document, nor analyse it for links.
Full details on how this tags works is provided:

Web Server Administrator's Guide to the Robots META tag
HTML Author's Guide to the Robots META tag

The original notes from the May 1996 IndexingWorkshop(http://www.robotstxt.org/wc/meta-notes.html)

참고사이트 ¶

http://tagin.net/bbs/view.php?id=server&no=9
http://www.robotstxt.org/wc/exclusion.html

Thread ¶

문제라면 생각보다 안지키는 검색엔진도 있다는 사실 - eternalbleu
엠파스,네이버,구글,MSN 등 대부분 검색엔진은 이 가이드라인(?)을 지키는거 같네요

-----
zyint