prettify

Jul 5, 2016

Design web crawler

1. single machine single thread

- start from a pool of URLs
- issue HTTP get the URLs
- parse the URL get and identify  URLs we want to crawl further
- add the new URLs into pool and keep crawling / no duplicate - bloom filter


2. crawling policy
- how often to retrieve the crawled page?
- robot.txt

3. de-dup
bloom filter

4. parse 

5. dns - bottle neck



Reference:
http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/
Design and Implementation of a High-Performance Distributed Web Crawler

No comments:

Post a Comment