[[tableofcontents]] = 목ì = Pythonì„ ì´ìš©í•´ì„œ Web Crawler를 ì œìž‘í•˜ë©´ì„œ Pythonì˜ ì‚¬ìš©ë²•ì„ ìµížˆê³ , ì›í•˜ëŠ” 웹 페ì´ì§€ë¥¼ ê¸ê¸° 위한 Web Crawler를 ì œìž‘í•œë‹¤. (네ì´ë²„웹툰(ëŒì•„온 ëŸí‚¤ì§±, ì‹ ì˜ íƒ‘...), 네ì´ë²„ ìºìŠ¤íŠ¸, ê·¸ ì™¸ì˜ ê°ì¢… 웹페ì´ì§€..) = í•„ìš”ê¸°ìˆ = * HTML * CSS * JavaScript * Python HTML, CSS, JavaScript - 웹 페ì´ì§€ ë¶„ì„ Python = ì§„í–‰ ê³¼ì • = == 필요한 문서 == * http://docs.python.org/ == 시작 == === 웹 페ì´ì§€ 소스 ê¸ì–´ì˜¤ê¸° === {{{ import urllib import urllib2 req = urllib2.Request('http://9632024.tistory.com/974') try: urllib2.urlopen(req) except URLError, e: print e.reason fo = open("test1.html","w") for line in urllib2.urlopen(req).readlines(): fo.write(line) fo.close() }}} * http://coreapython.hosting.paran.com/howto/HOWTO%20Fetch%20Internet%20Resources%20Using%20urllib2.htm === 소스ì—서 URLë§Œ 추출하기 === {{{ import urllib import urllib2 import string fo1 = open("test1.html", "r") fo2 = open("test2.html", "w") for line in fo1.readlines() : pos = string.find(line, '"http') if pos is not -1 : for c in range(pos+1, len(line)) : if line[c] is '"' : fo2.write("\n") break fo2.write(line[c]) fo1.close() fo2.close() }}} * http://docs.python.org/tutorial/controlflow.html * http://docs.python.org/tutorial/inputoutput.html === íŒŒì¼ ë‹¤ìš´ë¡œë“œí•˜ê¸° ===