1. single machine single thread
- start from a pool of URLs
- issue HTTP get the URLs
- parse the URL get and identify URLs we want to crawl further
- add the new URLs into pool and keep crawling / no duplicate - bloom filter
2. crawling policy
- how often to retrieve the crawled page?
- robot.txt
3. de-dup
bloom filter
4. parse
5. dns - bottle neck
Reference:
http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/
Design and Implementation of a High-Performance Distributed Web Crawler
No comments:
Post a Comment