notepad of ttan: Design web crawler

Jul 5, 2016

Design web crawler

1. single machine single thread

- start from a pool of URLs
- issue HTTP get the URLs
- parse the URL get and identify URLs we want to crawl further
- add the new URLs into pool and keep crawling / no duplicate - bloom filter

2. crawling policy
- how often to retrieve the crawled page?
- robot.txt

3. de-dup
bloom filter

4. parse

5. dns - bottle neck

Reference:
http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/
Design and Implementation of a High-Performance Distributed Web Crawler

notepad of ttan

prettify

Jul 5, 2016

Design web crawler

No comments:

Post a Comment