notepad of ttan: July 2016

Jul 6, 2016

Design Twitter trends

1. trending topic detection
-simple counting: tokennization(n-gram, dictionary based, noun-phares, hashtag), sliding window, remove stop words, compare with historical statistics,
-chi-square test
-topic specific models

2. event detection
- online clsustering
- online clustering using MinHash

3. Natural language processing for tweets

Reference

Trend Detection In Twitter Social Data https://www.youtube.com/watch?v=duHxpSTmwW0
slides http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/Kostas_Trends_Sept_13_2012.pdf

Predicting the Present with Google Trends http://people.ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf

Jul 5, 2016

USB设备的枚举过程

USB设备的枚举过程？
(1) Get Device Descriptor。主机的第一个命令要求得到设备描述符，此SETUP 包为8 个字节数据（80，06，00，01，00，00，40，00），发向地址0，端口0。“40”表示返回数据长度最大为40H 个字节。实际上，只返回一个包，即数组DEV_DESC[ ]中的前8 个字节，用于说明设备的描述符的真实长度和设备的类型。
(2) Set Address。接着是设置设备地址处理事件，主机发送一个含有指定地址的数据包（00，05，02，00，00，00，00，00），在主机只有一个USB 设备的时候，这个地址一般会是2，最大地址127，USB 协议中可以连接127 个设备。设置地址事件处理结束后，设备进入地址状态，主机以后会在新的指定地址处访问设备。
(3) Get Device Descriptor。主机再次发送请求得到设备描述符的数据包（80，06，00，01，00，00，12，00），与上次不同的是，要求的数据的长度是实际的数据长度，同时是发送到Set Address命令所设置的地址。
(4) 读取全部Configuration Descriptor。接着主机要求得到设备全部的配置描述符、接口描述符和节点描述符（80，06，00，02，00，00，40，00），由于主机不知道设备描述符的真实长度，因此它要求得到64个字节。
(5) Set Interface，主机发送数据包（01，0B，00，00，00，00，00，00），设置接口值为0。
(6) Set Conifguration，确定USB设备工作在哪一个配置下。对于U盘设备来说，一般只有1个配置值，其值为01。主机发送数据包（00，09，01，00，00，00，00，00）。
(7) 如果以上步骤都正确，主机将找到新设备，并且配置成功，该设备可以正常使用，可以进行后续的U盘枚举过程了。
(8) 用busHound观察计算机对于U盘的枚举过程，发现上述步骤后还有一个GetMaxLun的操作，但是实际上对于U盘来说忽略该步骤也没有问题

Design web crawler

1. single machine single thread

- start from a pool of URLs
- issue HTTP get the URLs
- parse the URL get and identify URLs we want to crawl further
- add the new URLs into pool and keep crawling / no duplicate - bloom filter

2. crawling policy
- how often to retrieve the crawled page?
- robot.txt

3. de-dup
bloom filter

4. parse

5. dns - bottle neck

Reference:
http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/
Design and Implementation of a High-Performance Distributed Web Crawler

Jul 4, 2016

Design random id generator

requirements:
- id limited to 64 bits
- id is increased by date

single machine
- integer? hard to scale to multiple server case

one ticket server
- bottle neck

multiple server
-each server with a id / o ruse mac address
-guid: random or mac+timestamp
- twitter's snowflake

clock synchronization
-NTP

reference:
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems

flickr's ticket server: http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

design a cache system

1. LRU cache (leas recent used)
If resource A is request by client
- if A is in cache, return
- if A is not in cache and the cache still have empty slot, read A from disk to cache and return
- if A is not in cache and cache is full, evict the least recent used slot, read A from disk to cache and return
Implement of LRU: cache and double linked list

2. Eviction policy
- random replacement (in ARM architecture)
- LRU
- least frequently used (keep a count for each entry)
- windowed LFU

3. Concurrency: 2 clients want write to the same entry
- lock / section lock
- commit log

4. Distributed cache:
value saves reference to other machine

reference
Memcached http://www.slideshare.net/oemebamo/introduction-to-memcached
https://en.wikipedia.org/wiki/Cache_algorithms

Design facebook

Design the Facebook news seed function

Reference:

Design the Facebook timeline function

Reference:

Design the Facebook chat function

Reference:

Erlang at Facebook
Facebook Chat
http://blog.gainlo.co/index.php/2016/04/19/design-facebook-chat-function/

design a key-value store

1. basic key-value: hash table in memory on single machine. while memory is not enough, a.compress data b.store reference to external/disk file

2. distributed key-value storage
-partition data / sharding
-balance loading
-system availability - replica
-consistency - the commit log, or coordinator holds the latest copy, or cordinator resolve the conflicts on the flyby reading multiple servers
- reading performance: use cache (refer to design a cache system)

refer: http://blog.gainlo.co/index.php/2016/06/14/design-a-key-value-store-part-i/
redis http://www.slideshare.net/dvirsky/introduction-to-redis

http://key-value-stories.blogspot.com/2015/02/memcached-internals-design.html
https://groups.google.com/forum/#!topic/memcached/BA67vpuGQYU

design a short url

use cases:
1.Convert convert_to_62based(md5(url)+random())
2.Redirect
3.Delete?
4.Premium/self defined key
5.UI for convert or API?
6.URL expire?
7. Show ads before redirect?
8.scale and performance
100 requests/seconds, 10 convert, 90 redirect
1year new added 365*24*3600*10= 315M entries
each item 500bytes disk. that's 160G bytes. 5year - 1T?
9. more details of implementation - key-value store: Cassandra (https://en.wikipedia.org/wiki/Apache_Cassandra), mySQL or plain text file, memory cache, etc

reference:
http://www.hiredintech.com/system-design/

https://developers.google.com/url-shortener/v1/getting_started

prettify

Jul 6, 2016

Jul 5, 2016

Jul 4, 2016