prettify
Jul 6, 2016
Design Twitter trends
1. trending topic detection
-simple counting: tokennization(n-gram, dictionary based, noun-phares, hashtag), sliding window, remove stop words, compare with historical statistics,
-chi-square test
-topic specific models
2. event detection
- online clsustering
- online clustering using MinHash
3. Natural language processing for tweets
Reference
Trend Detection In Twitter Social Data https://www.youtube.com/watch?v=duHxpSTmwW0
slides http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/Kostas_Trends_Sept_13_2012.pdf
Predicting the Present with Google Trends http://people.ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf
Jul 5, 2016
USB设备的枚举过程
USB设备的枚举过程?
(1) Get Device Descriptor。主机的第一个命令要求得到设备描述符,此SETUP 包为8 个字节数据(80,06,00,01,00,00,40,00),发向地址0,端口0。“40”表示返回数据长度最大为40H 个字节。实际上,只返回一个包,即数组DEV_DESC[ ]中的前8 个字节,用于说明设备的描述符的真实长度和设备的类型。
(2) Set Address。接着是设置设备地址处理事件,主机发送一个含有指定地址的数据包(00,05,02,00,00,00,00,00),在主机只有一个USB 设备的时候,这个地址一般会是2,最大地址127,USB 协议中可以连接127 个设备。设置地址事件处理结束后,设备进入地址状态,主机以后会在新的指定地址处访问设备。
(3) Get Device Descriptor。主机再次发送请求得到设备描述符的数据包(80,06,00,01,00,00,12,00),与上次不同的是,要求的数据的长度是实际的数据长度,同时是发送到Set Address命令所设置的地址。
(4) 读取全部Configuration Descriptor。接着主机要求得到设备全部的配置描述符、接口描述符和节点描述符(80,06,00,02,00,00,40,00),由于主机不知道设备描述符的真实长度,因此它要求得到64个字节。
(5) Set Interface,主机发送数据包(01,0B,00,00,00,00,00,00),设置接口值为0。
(6) Set Conifguration,确定USB设备工作在哪一个配置下。对于U盘设备来说,一般只有1个配置值,其值为01。主机发送数据包(00,09,01,00,00,00,00,00)。
(7) 如果以上步骤都正确,主机将找到新设备,并且配置成功,该设备可以正常使用,可以进行后续的U盘枚举过程了。
(8) 用busHound观察计算机对于U盘的枚举过程,发现上述步骤后还有一个GetMaxLun的操作,但是实际上对于U盘来说忽略该步骤也没有问题
(1) Get Device Descriptor。主机的第一个命令要求得到设备描述符,此SETUP 包为8 个字节数据(80,06,00,01,00,00,40,00),发向地址0,端口0。“40”表示返回数据长度最大为40H 个字节。实际上,只返回一个包,即数组DEV_DESC[ ]中的前8 个字节,用于说明设备的描述符的真实长度和设备的类型。
(2) Set Address。接着是设置设备地址处理事件,主机发送一个含有指定地址的数据包(00,05,02,00,00,00,00,00),在主机只有一个USB 设备的时候,这个地址一般会是2,最大地址127,USB 协议中可以连接127 个设备。设置地址事件处理结束后,设备进入地址状态,主机以后会在新的指定地址处访问设备。
(3) Get Device Descriptor。主机再次发送请求得到设备描述符的数据包(80,06,00,01,00,00,12,00),与上次不同的是,要求的数据的长度是实际的数据长度,同时是发送到Set Address命令所设置的地址。
(4) 读取全部Configuration Descriptor。接着主机要求得到设备全部的配置描述符、接口描述符和节点描述符(80,06,00,02,00,00,40,00),由于主机不知道设备描述符的真实长度,因此它要求得到64个字节。
(5) Set Interface,主机发送数据包(01,0B,00,00,00,00,00,00),设置接口值为0。
(6) Set Conifguration,确定USB设备工作在哪一个配置下。对于U盘设备来说,一般只有1个配置值,其值为01。主机发送数据包(00,09,01,00,00,00,00,00)。
(7) 如果以上步骤都正确,主机将找到新设备,并且配置成功,该设备可以正常使用,可以进行后续的U盘枚举过程了。
(8) 用busHound观察计算机对于U盘的枚举过程,发现上述步骤后还有一个GetMaxLun的操作,但是实际上对于U盘来说忽略该步骤也没有问题
Design web crawler
1. single machine single thread
- start from a pool of URLs
- issue HTTP get the URLs
- parse the URL get and identify URLs we want to crawl further
- add the new URLs into pool and keep crawling / no duplicate - bloom filter
2. crawling policy
- how often to retrieve the crawled page?
- robot.txt
3. de-dup
bloom filter
4. parse
5. dns - bottle neck
Reference:
http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/
Design and Implementation of a High-Performance Distributed Web Crawler
- start from a pool of URLs
- issue HTTP get the URLs
- parse the URL get and identify URLs we want to crawl further
- add the new URLs into pool and keep crawling / no duplicate - bloom filter
2. crawling policy
- how often to retrieve the crawled page?
- robot.txt
3. de-dup
bloom filter
4. parse
5. dns - bottle neck
Reference:
http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/
Design and Implementation of a High-Performance Distributed Web Crawler
Jul 4, 2016
Design random id generator
requirements:
- id limited to 64 bits
- id is increased by date
single machine
- integer? hard to scale to multiple server case
one ticket server
- bottle neck
multiple server
-each server with a id / o ruse mac address
-guid: random or mac+timestamp
- twitter's snowflake
clock synchronization
-NTP
reference:
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems
- id limited to 64 bits
- id is increased by date
single machine
- integer? hard to scale to multiple server case
one ticket server
- bottle neck
multiple server
-each server with a id / o ruse mac address
-guid: random or mac+timestamp
- twitter's snowflake
clock synchronization
-NTP
reference:
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems
flickr's ticket server: http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
design a cache system
1. LRU cache (leas recent used)
If resource A is request by client
- if A is in cache, return
- if A is not in cache and the cache still have empty slot, read A from disk to cache and return
- if A is not in cache and cache is full, evict the least recent used slot, read A from disk to cache and return
Implement of LRU: cache and double linked list
2. Eviction policy
- random replacement (in ARM architecture)
- LRU
- least frequently used (keep a count for each entry)
- windowed LFU
3. Concurrency: 2 clients want write to the same entry
- lock / section lock
- commit log
4. Distributed cache:
value saves reference to other machine
reference
Memcached http://www.slideshare.net/oemebamo/introduction-to-memcached
https://en.wikipedia.org/wiki/Cache_algorithms
If resource A is request by client
- if A is in cache, return
- if A is not in cache and the cache still have empty slot, read A from disk to cache and return
- if A is not in cache and cache is full, evict the least recent used slot, read A from disk to cache and return
Implement of LRU: cache and double linked list
2. Eviction policy
- random replacement (in ARM architecture)
- LRU
- least frequently used (keep a count for each entry)
- windowed LFU
3. Concurrency: 2 clients want write to the same entry
- lock / section lock
- commit log
4. Distributed cache:
value saves reference to other machine
reference
Memcached http://www.slideshare.net/oemebamo/introduction-to-memcached
https://en.wikipedia.org/wiki/Cache_algorithms
Design facebook
Design the Facebook news seed function
Reference:
Design the Facebook timeline function
Reference:
Design the Facebook chat function
Reference:
- Erlang at Facebook
- Facebook Chat
- http://blog.gainlo.co/index.php/2016/04/19/design-facebook-chat-function/
design a key-value store
1. basic key-value: hash table in memory on single machine. while memory is not enough, a.compress data b.store reference to external/disk file
2. distributed key-value storage
-partition data / sharding
-balance loading
-system availability - replica
-consistency - the commit log, or coordinator holds the latest copy, or cordinator resolve the conflicts on the flyby reading multiple servers
- reading performance: use cache (refer to design a cache system)
refer: http://blog.gainlo.co/index.php/2016/06/14/design-a-key-value-store-part-i/
redis http://www.slideshare.net/dvirsky/introduction-to-redis
http://key-value-stories.blogspot.com/2015/02/memcached-internals-design.html
https://groups.google.com/forum/#!topic/memcached/BA67vpuGQYU
2. distributed key-value storage
-partition data / sharding
-balance loading
-system availability - replica
-consistency - the commit log, or coordinator holds the latest copy, or cordinator resolve the conflicts on the flyby reading multiple servers
- reading performance: use cache (refer to design a cache system)
refer: http://blog.gainlo.co/index.php/2016/06/14/design-a-key-value-store-part-i/
redis http://www.slideshare.net/dvirsky/introduction-to-redis
http://key-value-stories.blogspot.com/2015/02/memcached-internals-design.html
https://groups.google.com/forum/#!topic/memcached/BA67vpuGQYU
design a short url
use cases:
1.Convert convert_to_62based(md5(url)+random())
2.Redirect
3.Delete?
4.Premium/self defined key
5.UI for convert or API?
6.URL expire?
7. Show ads before redirect?
8.scale and performance
100 requests/seconds, 10 convert, 90 redirect
1year new added 365*24*3600*10= 315M entries
each item 500bytes disk. that's 160G bytes. 5year - 1T?
9. more details of implementation - key-value store: Cassandra (https://en.wikipedia.org/wiki/Apache_Cassandra), mySQL or plain text file, memory cache, etc
reference:
http://www.hiredintech.com/system-design/
https://developers.google.com/url-shortener/v1/getting_started
Subscribe to:
Posts (Atom)