prettify

Jul 6, 2016

Design Twitter trends



1. trending topic detection
-simple counting: tokennization(n-gram, dictionary based, noun-phares, hashtag), sliding window, remove stop words, compare with historical statistics,
-chi-square test
-topic specific models


2. event detection
- online clsustering
- online clustering using MinHash

3. Natural language processing for tweets

Reference

Trend Detection In Twitter Social Data https://www.youtube.com/watch?v=duHxpSTmwW0
slides http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/Kostas_Trends_Sept_13_2012.pdf

 Predicting the Present with Google Trends http://people.ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf


Jul 5, 2016

USB设备的枚举过程

 USB设备的枚举过程?
(1) Get Device Descriptor。主机的第一个命令要求得到设备描述符,此SETUP 包为个字节数据(8006000100004000),发向地址0,端口0“40”表示返回数据长度最大为40H 个字节。实际上,只返回一个包,即数组DEV_DESC[ ]中的前个字节,用于说明设备的描述符的真实长度和设备的类型。
(2) Set Address。接着是设置设备地址处理事件,主机发送一个含有指定地址的数据包(0005020000000000),在主机只有一个USB 设备的时候,这个地址一般会是2,最大地址127USB 协议中可以连接127 个设备。设置地址事件处理结束后,设备进入地址状态,主机以后会在新的指定地址处访问设备。
(3) Get Device Descriptor。主机再次发送请求得到设备描述符的数据包(8006000100001200),与上次不同的是,要求的数据的长度是实际的数据长度,同时是发送到Set Address命令所设置的地址。
(4) 读取全部Configuration Descriptor。接着主机要求得到设备全部的配置描述符、接口描述符和节点描述符(8006000200004000),由于主机不知道设备描述符的真实长度,因此它要求得到64个字节。
(5) Set Interface,主机发送数据包(010B000000000000),设置接口值为0
(6) Set Conifguration,确定USB设备工作在哪一个配置下。对于U盘设备来说,一般只有1个配置值,其值为01。主机发送数据包(0009010000000000)。
(7) 如果以上步骤都正确,主机将找到新设备,并且配置成功,该设备可以正常使用,可以进行后续的U盘枚举过程了。
(8) busHound观察计算机对于U盘的枚举过程,发现上述步骤后还有一个GetMaxLun的操作,但是实际上对于U盘来说忽略该步骤也没有问题

Design web crawler

1. single machine single thread

- start from a pool of URLs
- issue HTTP get the URLs
- parse the URL get and identify  URLs we want to crawl further
- add the new URLs into pool and keep crawling / no duplicate - bloom filter


2. crawling policy
- how often to retrieve the crawled page?
- robot.txt

3. de-dup
bloom filter

4. parse 

5. dns - bottle neck



Reference:
http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/
Design and Implementation of a High-Performance Distributed Web Crawler

Jul 4, 2016

Design random id generator

requirements:
- id limited to 64 bits
- id is increased by date

single machine
- integer? hard to scale to multiple server case

one ticket server
- bottle neck

multiple server
-each server with a id / o ruse mac address
-guid: random or mac+timestamp
- twitter's snowflake

clock  synchronization
-NTP


reference:
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems

flickr's ticket server:  http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

design a cache system

1. LRU cache (leas recent used)
If resource A is request by client
- if A is in cache, return
- if A is not in cache and the cache still have empty slot, read A from disk to cache and return
- if A is not in cache and cache is full, evict the least recent used slot, read A from disk to cache and return
Implement of LRU: cache and double linked list

2. Eviction policy
- random replacement (in ARM architecture)
- LRU
- least frequently used (keep a count for each entry)
- windowed LFU

3. Concurrency: 2 clients want write to the same entry
- lock / section lock
- commit log

4. Distributed cache:
value saves reference to other machine

reference
Memcached  http://www.slideshare.net/oemebamo/introduction-to-memcached
https://en.wikipedia.org/wiki/Cache_algorithms

Design facebook

Design the Facebook news seed function
Reference:

Design the Facebook timeline function
Reference:
Design the Facebook chat function
Reference:

design a key-value store

1. basic key-value: hash table in memory on single machine. while memory is not enough, a.compress data b.store reference to external/disk file

2. distributed key-value storage
-partition data / sharding
-balance loading
-system availability - replica
-consistency - the commit log, or coordinator holds the  latest copy, or cordinator resolve the conflicts on the flyby reading multiple servers
- reading performance: use cache  (refer to design a cache system)

refer: http://blog.gainlo.co/index.php/2016/06/14/design-a-key-value-store-part-i/
redis  http://www.slideshare.net/dvirsky/introduction-to-redis

http://key-value-stories.blogspot.com/2015/02/memcached-internals-design.html
https://groups.google.com/forum/#!topic/memcached/BA67vpuGQYU



 



design a short url


use cases: 
1.Convert   convert_to_62based(md5(url)+random())
2.Redirect 
3.Delete? 
4.Premium/self defined key
5.UI for convert or API? 
6.URL expire? 
7. Show ads before redirect?
8.scale and performance 
100 requests/seconds, 10 convert, 90 redirect
1year new added  365*24*3600*10= 315M entries
each item 500bytes disk. that's 160G bytes. 5year - 1T?
9. more details of implementation - key-value store: Cassandra (https://en.wikipedia.org/wiki/Apache_Cassandra) mySQL or plain text file, memory cache, etc


reference:
http://www.hiredintech.com/system-design/

https://developers.google.com/url-shortener/v1/getting_started