3.2.1. 웹 페이지 소스 긁어오기
3.2.2. 소스에서 URL만 추출하기
3.2.3. 파일 다운로드하기
3.2.4. 디렉토리 만들기

1. 목적 ¶

Python을 이용해서 Web Crawler를 제작하면서 Python의 사용법을 익히고, 원하는 웹 페이지를 긁기 위한 Web Crawler를 제작한다. (네이버웹툰(돌아온 럭키짱, 신의 탑...), 네이버 캐스트, 그 외의 각종 웹페이지..)

[edit]

2. 필요기술 ¶

HTML
CSS
JavaScript
Python

HTML, CSS, JavaScript - 웹 페이지 분석
Python

[edit]

3. 진행 과정 ¶

[edit]

3.1. 필요한 문서 ¶

http://docs.python.org/
http://hyogeun.tistory.com/107 - try, except.

[edit]

3.2. 시작 ¶

[edit]

3.2.1. 웹 페이지 소스 긁어오기 ¶


import urllib
import urllib2

req = urllib2.Request('http://9632024.tistory.com/974')
try: urllib2.urlopen(req)
except URLError, e:
	print e.reason

fo = open("test1.html","w")
for line in urllib2.urlopen(req).readlines():
	fo.write(line)

fo.close()

http://coreapython.hosting.paran.com/howto/HOWTO Fetch Internet Resources Using urllib2.htm

[edit]

3.2.2. 소스에서 URL만 추출하기 ¶

import urllib
import urllib2
import string

fo1 = open("test1.html", "r")
fo2 = open("test2.html", "w")

for line in fo1.readlines() :
	pos = string.find(line, '"http')
	if pos is not -1 :
		for c in range(pos+1, len(line)) :
			if line[c] is '"' :
				fo2.write("\n")
				break
			fo2.write(line[c])

fo1.close()
fo2.close()

[edit]

3.2.3. 파일 다운로드하기 ¶

import urllib
import urllib2


fo = open("test2.html", "r")
for line in fo.readlines():
	urllib.urlretrieve(line,line.split('/')[-1])

fo.close()

http://www.wellho.net/resources/ex.php4?item=y108/bejo.py
split

line = 'http://cfile23.uf.tistory.com/original/2001D2044C945F80495C6F'
line.split('/')-1 == '2001D2044C945F80495C6F'
line.split('/')-2 == 'original'

say = "This is a line of text"
part = line.split(' ')
part == 'This', 'is', 'a', 'line', 'of', 'text'
swap

Python 2.7.2+ (default, Oct  4 2011, 20:03:08) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> first = 1
>>> second = 2
>>> first, second = second, first
>>> print first
2
>>> print second
1
>>> first, second = second, first
>>> third = 3
>>> first, second, third = third, first, second
>>> print first, second, third
3 1 2

retrieve
urllib.urlretrieve(url[, filename[, reporthook[, data]]])
http://docs.python.org/library/urllib.html

[edit]

3.2.4. 디렉토리 만들기 ¶

import os

os.chdir(os.getcwd() + '/folder')
def create_dir(folder):
	cdir = os.getcwd()
	mdir = cdir + folder
	print mdir;
	if os.path.isdir(mdir) is  False :
		os.mkdir(mdir , 0755)

type = ['/mp3', '/jpg', '/txt']
for t in type :
	create_dir(t)

os.chdir(path) - Change the current working directory to path.
os.getcwd() - Return a string representing the current working directory.
os.path.isdir(path) - Return True if path is an existing directory.
os.mkdir(path, mode) - Create a directory named path with numeric mode mode. If the directory already exists, OSError is raised.

http://docs.python.org/library/os.html
http://docs.python.org/library/os.path.html#module-os.path
mode -
d - 디렉토리 구분
r - 읽기 권한
w - 쓰기 권한
x - 실행 권한

d / rwx / r-x / r-x

디렉토리 소유자 권한 그룹 권한 전체 권한

r(4)w(2)x(1)
755 -> drwxr-xr-x
http://snowbora.com/343

[edit]

3.3. 개선해야 할 점 ¶

파일을 저장할 떄 소스 파일에 저장이 되서 지저분하다.
필요하지 않은 파일까지 전부 긁어온다.
내가 인터넷에서 jpg 파일 긁어오려고 만든 파이썬 코드 있는데 혹시 필요함? - 서민관
- 그러면 매우 감사하죠 ㅎㅎ - 권영기
http://git-scm.com/
http://wiki.kldp.org/wiki.php/SubversionBook/BranchingAndMerging

권영기/web crawler (rev. 1.23)

권영기/web crawler

Contents

1. 목적 ¶

2. 필요기술 ¶

3. 진행 과정 ¶

3.1. 필요한 문서 ¶

3.2. 시작 ¶

3.2.1. 웹 페이지 소스 긁어오기 ¶

3.2.2. 소스에서 URL만 추출하기 ¶

3.2.3. 파일 다운로드하기 ¶

3.2.4. 디렉토리 만들기 ¶

3.3. 개선해야 할 점 ¶