System Design面试

Design复习

  • 如何秒掉99%的海量数据题!!! http://blog.csdn.net/v_july_v/article/details/7382693
  • 十道海量数据题: http://blog.csdn.net/v_july_v/article/details/6279498
    • 总而言之就是hash function写到小文件,然后再处理,然后再归并
    • 注意
      • bloom filter
        • k个hash,映射到bit vector
        • 快速,节约,但有false positive且不能进行删除操作
      • hashing
      • bit-map
      • heap
      • external sort
      • trie
      • mapreduce
  • Quora关于系统设计题 http://www.quora.com/Job-Interviews/How-should-I-prepare-system-design-questions-for-Google-Facebook-Interview
  • 很有见地的一系列博文!!link
    • 重点看GFT三家的解剖!!!
    • “做大的艺术”也有对分布式的讨论
    • 缓存人气作者
  • 非常好的总结!!!link
  • facebook design 总结!!其实也是一个非常非常好的general总结 link
    • Memcache (不是memcached,那个开源的内存cache)
    • Scalable Web Architecture and Distributed Systems
      • 对于一个像flickr一样的网站,更重要的是requesting time而不是上传时间。所以优化只要集中在读的一侧而不是上传的一侧
  • flickr架构 link
  • Pinterest架构 link
  • 买买提上关于除了刷题的建议link
  • 看一下段地址的完全实现过程
  • 解剖Twitter http://blog.sina.com.cn/s/blog_46d0a3930100fai8.html
  • Google Calendar的实现过程
  • Google的系统设计题:http://www.mitbbs.com/article_t/JobHunting/32562595.html
    • 如果实现像facebook一样的activity broadcast.
    • 一个朋友f1更新一条信息后,他的所有朋友都要看到这条信息。
    • 条件:
    • 1. 朋友数量极大。
    • 2. 有的朋友在线,有的不在线,不在线的登陆后要能看到。
    • 3. 如果储存空间有限,如何处理。
  • 如何开发一个聊天室 http://book.mixu.net/node/ch3.html
  • Terasort link
    • 普通归并排序的瓶颈在于合并是是sequential的
    • Terasort保证mapper处理的各个块,i的所有元素都比i + 1的大,这样就可以自然的归并了

Keywords for Design

  • sharding (partition)
  • 非常好地区分sharding和replication!link
  • load balancer
  • master-slave replication
  • consistent hashign
  • bloom filter
  • Kafka vs RabbitMQ
    • 两个都是消息队列,message broker
  • CDN
  • peak throughput

Hired in Tech – System Design link

  • Process
    • Step 1 – Constraints and Use Cases (5 min!)
      • Very first thing to do when faced with the problem: clarify the constraints and use case.
      • Don’t panic about the correctness, write some numbers down!
      • Write on board: the use cases like
        • (Basic features)
        1. take a url => return a shorter one (shortening)
        2. take a short => return the original (redirection)
        3. (ask some more augmented features?) 
        4. customized url?
        5. analytics?
        6. automatic link expiration?
        7. UI? API?
      • Write on board, constraints. Traffic and Data
        1. scale down to seconds! Estimate the traffic and data
        2. 100M new urls per month
        3. What can be cached?? 10% from shortening and 90% from redirection
        4. how long is url?
      • 都是先算几年或者几个月有多少,然后scale到秒!
      • Ready heavy? Write heavy?
    • Step 2 – Abstract Design
      • draw sketch,不要管细节!先把基础功能实现!
      • Application service layer 和 Data Storage Layer
    • Step 3 – Understand Bottlenecks
      • Need a load balancer? Data is so huge that need to distribute the database?
      • Talk about the tradeoff!!!
      • 这一步要结合1和2,比如1中的data per sec,就可以用来估算数据库是否有效?或者如果每次都要搜遍数据库的话,会不会效率过低?Bottlenecks identified!!!
    • Step 4 – Actual Design of a Scalable System!
      • (next section)
      • whenever get challenged by interviewer about the architectural choices, acknowledge that rarely an idea is perfect, and outline the advantages and disadvantages of your choice
      • add load balancer is not only for amortize traffic, but also to deal with spiky traffic!! and shut them down when the traffic goes back to normal. 
      • Don’t forget to mention we need benchmark, profiling, and load test the architecture
  • Fundamentals
    • (Scalability @Harvard)
    • (Scalability for Dummies)
    • (Database sharding)
    • and don’t forget about the monitor
    • sometimes there will be a flash traffic when many people want to watch/read the same thing
  • Examples
    • (highscalability blog)
  • Wrap-up
  • Practice

Scalability @Havard

  • Vertical scaling
    • CPU cores, RAM, disks, etc.
    • why not a full solution? Real-world constraints…
    • Updating drive – updating IO of databases
    • RAID
  • Horizontal scaling
    • Use more cheaper hardware, multiple servers
    • problem of CONSISTENCY, SYNCHRONIZATION
    • 用load balancer回复客户端请求,即暴露一个公共IP给外部;而所有服务器都用的内部IP,客户请求都是由load balancer来分配! 
  • Load balancing
    • Round-robin请求 to 各个服务器
    • Implement load balancer:
      • Software: HAProxy etc.
      • Hardware: 100K for decent one…
    • Use multiple load balancers to avoid single node failure at the load balancer
    • (from HiredInTech) add load balancer is not only for amortize traffic, but also to deal with spiky traffic!! and shut them down when the traffic goes back to normal. Also, when one server is down, use others (reliability)
  • Caching
    • .html – redundancy, no template, must write <header> … lots of find and replace 
    • MySQL query cache
    • memcached 
      • memory cache, which is also a server
      • if read-heavy, then this is very efficient!
  • DB replication
    • master-slave, benefit for read-heavy websites
    • master-master, 
    • load balancer can also generate a single-node failure
    • use heat beat to detect activity
  • DB partitioning
    • different multiple servers: mit.thefacebook.com, harvard.thefacebook.com
    • A-H to server1 I-Z to server 2, etc.
  • 最后还有一段实例,很好~
    • Sync between data center?
      • load balancing on DNS level!

Scalability for Dummies link

  • part 1, replicas
  • part 2, database can become the bottleneck. Switch to NoSQL.
  • part 3, caching – in-memory caching like Redis, not file-based caching!
    • cache queries (but hard to remove)
    • cache objects, makes asynchronous processing possible
  • part 4, asynchronism
    • async 1: cache the dynamic content, pre-rendered HTML
    • async 2: (computing-intensive tasks) pass to backend, constantly check if the job is done

Database Sharding

  • 一种data partition的方式,主要用的不是replica,而是每个廉价服务器服务一部分用户
    • high availability
    • faster queries
    • more write bandwidth (parallel writing!)
    • no replica (???)

系统设计面试题思路汇总 link

  • 基本可以用Google三驾马车解决:MapReduce, GFS, BigTable
  • 三驾马车之前,常用架构是:database + sharding + cache
  • 虚拟节点技术
    • 虚拟节点技术:该技术常用于分布式数据分片中。具体应用场景是:有一大坨数据(maybe TB级或者PB级),我们需按照某个字段(key)分片存储到几十(或者更多)台机器上,同时想尽量负载均衡且容易扩展。传统的做法是:Hash(key) mod N,这种方法最大缺点是不容易扩展,即:增加或者减少机器均会导致数据全部重分布,代价忒大。于是乎,新技术诞生了,其中一种是上面提到的DHT,现在已经被很多大型系统采用,还有一种是对“Hash(key) mod N”的改进:假设我们要将数据分不到20台机器上,传统做法是hash(key) mod 20,而改进后,N取值要远大于20,比如是20000000,然后我们采用额外一张表记录每个节点存储的key的模值,比如:node1:0~1000000;node2:1000001~2000000。。。。。。这样,当添加一个新的节点时,只需将每个节点上部分数据移动给新节点,同时修改一下这个表即可
  • Q:设计一个系统,使写速度提高 -> BigTable, 并发写到一个cache里面(随机写),然后cache满了在写到磁盘(顺序写,快!)

High Scalability

  • YouTube architecture
    • Most popular content is moved to CDN
  • 8个重要的scalable design pattern link
  • What is your scalability plan link
    • 写的不错!是一步一步来的
  • 咖啡店的比喻,不错!link

公司核心技术复习

  • 必看论文
    • Google: Google File System, MapReduce, Bigtable
      • Bigtable
        • 类似于一个分布式kvs
        • key是<行,列,timestamp>
        • tablet是最小单位
    • Facebook: Cassandra
    • Amazon: Dynamo
  • 注意distributed data store
    • Non-relational, therefore quick

Google

  • Web crawling and indexes link
    • the crawler should be distributed
    • robots.txt
    • DNS resolution is a well-known bottleneck in web crawling
    • BFS的时候用优先队列,权值为key
  • DESIGN
    • 大合集!!!
    • 分布式crawler link
    • autocomplete
      • trie
      • HMM?
      • popularity
    • 读写锁
      • read write lock, 又叫share lock和exclusive lock
      • lock()和trylock()的区别:一个是blocked一个不block
      • 这篇写得不错 link
      • 两道面试题 link link
      • 注意writer starvation
      • producer comsumer link
      • mutex和cv实现semaphore link
    • distributed kv storage (DHT B+ tree)
    • distributed lock
      • 其实还是基于Paxos
    • distributed hash 
      • 分布式哈希基本介绍 link link
      • 一致性哈希 link
        • 普通线性hash缺点就在于如果移除、添加机器,需要有大量数据迁移!
      • GFS是基于metadata的实现!看!!->单点故障->一致性哈希
  • 这篇high scalability link
  • how to implement online booking link
  • 购物车 link
    • cookie和session的实现,以及与数据库的联系,讲得不错!

Facebook

  • Zuck’s law
  • 这篇文章!!!!http://www.mitbbs.com/article_t/JobHunting/32741713.html
  • Hiphop compiler
Facebook Stats

Facebook Infrastructure

  •  

Facebook feeds 

  • http://readme.skplanet.com/wp-content/uploads/2012/11/0-3_Facebook1.pdf 讲得很爽
  • 系统的几个文章
    • 关于feeds的general discuss!!!!link
    • SO上的 link
    • 这里面的几个链接都非常好!!!
      • push model for people who don’t have many followers
      • pull model for the other way around

Tech Talks

  • “Facebook is not about actions, but interactions”

 

Building applications is not a hard thing, but having the vision of the overall architecture really makes a difference. We have to crack the system interview sooner or later in our career as a software engineer or engineering manager.

Interviewers are looking for future teammates that they like to work with. The future teammates are expected to be, at least, capable of solving problems independently. There are so many solutions to any given problem, but not all of them are suited given the context. So the interviewee has to specify different choices and their tradeoffs. To summarize, system design interview is a happy show-off of our knowledge on technologies and their tradeoffs. Ideally, keep talking what the interviewer expect throughout the interview, before they even have to ask.

Keep talking for 45 mins could be easy, as long as we are armed with the following four steps and three common topics. Take“design Pinterest” for example.

1. Four steps to Crack the System Design Interview

Breaking down a complex task into small chunks helps us handle the problem at a better pace and in a more actionable way.

1.1 Step 1. Clarify Requirements and Specs

First things first, the ultimate goals should always be clear.

Pinterest is a highly scalable photo-sharing service:

  • features: user profile, upload photos, news feed, etc.
  • scaling out: horizontal scalability and micro services.

1.2 Step 2. Sketch Out High Level Design

Do not dive into details before outlining the big picture. Otherwise, going off too far towards a wrong direction would make it harder to even provide a roughly correct solution. We will regret wasting time on irrelevant details when we do not have time to finish the task.

OK, let us sketch out the following diagram without concerning too much about the implementation detail of these components.

pinterest architecture

So far so good! To some extent, congrats, we have solved the problem!

1.3 Step 3. Discuss individual components and how they interact in detail

When we truly understand a system, we should be able to identify what each component is and explain how they interact with one another. Take these components in the above diagram and specify each one by one. This could lead to more general discussions, such as the three common topics in Section 2, and to more specific domains, like how to design the photo storage data layout

1.3.1 Load Balancer

Generally speaking, load balancers fall into three categories:

  • DNS Round Robin (rarely used): clients get a randomly-ordered list of IP addresses.
    • pros: easy to implement and free
    • cons: hard to control and not responsive, since DNS cache needs time to expire
  • L3/L4 Load Balancer: traffic is routed by IP address and port. L3 is network layer (IP). L4 is session layer (TCP).
    • pros: better granularity, simple, responsive
  • L7 Load Balancer: traffic is routed by what is inside the HTTP protocol. L7 is application layer (HTTP).

It is good enough to talk in this level of detail on this topic, but in case the interviewer wants more, we can suggest exact algorithms like round robin, weighted round robin, least loaded, least loaded with slow start, utilization limit, latency, cascade, etc.

1.3.2 Reverse Proxy

Reverse proxy, like varnish, centralizes internal services and provides unified interfaces to the public. For example, www.example.com/index and www.example.com/sports appear to come from the same domain, but in fact they are from different micro services behind the reverse proxy. Reverse proxy could also help with caching and load balancing.

1.3.3 (Frontend) Web Tier

This is where web pages are served, and usually combined with the service / backend tier in the very early stage of a web service.

Stateless

There are two major bottlenecks of the whole system – requests per second (rps) and bandwidth. We could improve the situation by using more efficient tech stack, like frameworks with async and non-blocking reactor pattern, and enhancing the hardware, like scaling up (aka vertical scaling) or scaling out (aka horizontal scaling).

Internet companies prefer scaling out, since it is more cost-efficient with a huge number of commodity machines. This is also good for recruiting, because the target skillsets are equipped by. After all, people rarely play with super computers or mainframes at home.

Frontend web tier and service tier must be stateless in order to add or remove hosts conveniently, thus achieving horizontal scalability. As for feature switch or configs, there could be a database table / standalone service to keep those states.

Web Application and API

MVC(MVT) or MVVC pattern is the dominant pattern for this layer. Traditionally, view or template is rendered to HTML by the server at runtime. In the age of mobile computing, view can be as simple as serving the minimal package of data transporting to the mobile devices, which is called web API. People believe that the API can be shared by clients and browsers. And that is why single page web applications are becoming more and more popular, especially with the assistance of frontend frameworks like react.js, angular.js, backbone.js, etc.

1.3.4 App Service Tier

The single responsibility principle advocates small and autonomous services that work together, so that each service can do one thing well and not block others. Small teams with small services can plan much more aggressively for the sake of hyper-growth.

Service Discovery

How do those services find each other? Zookeeper is a popular and centralized choice. Instances with name, address, port, etc. are registered into the path in ZooKeeper for each service. If one service does not know where to find another service, it can query Zookeeper for the location and memorize it until that location is unavailable.

Zookeeper is a CP system in terms of CAP theorem (See Section 2.3 for more discussion), which means it stays consistent in the case of failures, but the leader of the centralized consensus will be unavailable for registering new services.

In contrast to Zookeeper, Uber is doing interesting work in a decentralized way, named hyperbahn, based on Ringpop consisten hash ring. Read Amazon’s Dynamo to understand AP and eventual consistency.

Micro Services

For the Pinterest case, these micro services could be user profile, follower, feed, search, spam, etc. Any of those topics could lead to an in-depth discussion. Useful links are listed in Section 3: Future Studies, to help us deal with them.

However, for a general interview question like “design Pinterest”, it is good enough to leave those services as black boxes.. If we want to show more passion, elaborate with some sample endpoints / APIs for those services would be great.

1.3.5 Data Tier

Although a relational database can do almost all the storage work, please remember do not save a blob, like a photo, into a relational database, and choose the right database for the right service. For example, read performance is important for follower service, therefore it makes sense to use a key-value cache. Feeds are generated as time passes by, so HBase / Cassandra’s timestamp index is a great fit for this use case. Users have relationships with other users or objects, so a relational database is our choice by default in an user profile service.

Data and storage is a rather wide topic, and we will discuss it later in Section 2.2 Storage.

1.4 (Optional) Step 4. Back-of-the-envelope Calculation

The final step, estimating how many machines are required, is optional, because time is probably up after all the discussions above and three common topics below. In case we run into this topic, we’d better prepare for it as well. It is a little tricky… we need come up with some variables and a function first, and then make some guesses for the values of those variables, and finally calculate the result.

The cost is a function of CPU, RAM, storage, bandwidth, number and size of the images uploaded each day.

  • N users 1010
  • i images / (user * day) 10
  • s size in bytes / image 106
  • viewed v times / image 100
  • d days
  • h requests / sec 104 (bottleneck)
  • b bandwidth (bottleneck)
  • Server cost: $1000 / server
  • Storage cost: $0.1 / GB
  • Storage = Nisd

Remember the two bottlenecks we mentioned in section 1.3.3 Web Tier? – requests per second (rps) and bandwidth. So the final expression would be

Number of required servers = max(Niv/h, Nivs/b)

2 Three Common Topics

There are three common topics that could be applied to almost every system design question. They are extracted and summarized in this section.

2.1 Communication

How do different components interact with each other? – communication protocols.

Here is a simple comparison of those protocols.

  • UDP and TCP are both transport layer protocols. TCP is reliable and connection-based. UDP is connectionless and unreliable.
  • HTTP is in the application layer and normally TCP based, since HTTP assumes a reliable transport.
  • RPC, an application layer protocol, is an inter-process communication that allows a computer program to cause a subroutine or procedure to execute in another address space (commonly on another computer on a shared network), without the programmer explicitly coding the details for this remote interaction. That is, the programmer writes essentially the same code whether the subroutine is local to the executing program, or remote. In an Object-Oriented Programming context, RPC is also called remote invocation or remote method invocation (RMI).
Further discussions

Since RPC is super useful, some interviewers may ask how RPC works. The following picture is a brief answer.

RPC

*Stub procedure: a local procedure that marshals the procedure identifier and the arguments into a request message, and then to send via its communication module to the server. When the reply message arrives, it unmarshals the results.

We do not have to implement our own RPC protocols. There are off-the-shelf frameworks.

  • Google Protobuf: an open source RPC with only APIs but no RPC implementations. Smaller serialized data and slightly faster. Better documentations and cleaner APIs.
  • Facebook Thrift: supports more languages, richer data structures: list, set, map, etc. that Protobuf does not support) Incomplete documentation and hard to find good examples.
    • User case: Hbase/Cassandra/Hypertable/Scrib/..
  • Apache Avro: Avro is heavily used in the hadoop ecosystem and based on dynamic schemas in Json. It features dynamic typing, untagged data, and no manually-assigned field IDs.

Generally speaking, RPC is internally used by many tech companies for performance issues, but it is rather hard to debug and not flexible. So for public APIs, we tend to use HTTP APIs, and are usually following the RESTful style.

  • REST (Representational state transfer of resources)
    • Best practice of HTTP API to interact with resources.
    • URL only decides the location. Headers (Accept and Content-Type, etc.) decide the representation. HTTP methods(GET/POST/PUT/DELETE) decide the state transfer.
    • minimize the coupling between client and server (a huge number of HTTP infras on various clients, data-marshalling).
    • stateless and scaling out.
    • service partitioning feasible.
    • used for public API.

2.2 Storage

2.2.1 Relational Database

Relational database is the default choice for most use cases, by reason of ACID (atomicity, consistency, isolation, and durability). One tricky thing is consistency – it means that any transaction will bring database from one valid state to another, (different from the consistency in CAP, which will be discussed in Section 2.3).

Schema Design and 3rd Normal Form (3NF)

To reduce redundancy and improve consistency, people follow 3NF when designing database schemas:

  • 1NF: tabular, each row-column intersection contains only one value
  • 2NF: only the primary key determines all the attributes
  • 3NF: only the candidate keys determine all the attributes (and non-prime attributes do not depend on each other)
Db Proxy

What if we want to eliminate single point of failure? What if the dataset is too large for one single machine to hold? For MySQL, the answer is to use a DB proxy to distribute data, either by clustering or by sharding.

Clustering is a decentralized solution. Everything is automatic. Data is distributed, moved, rebalanced automatically. Nodes gossip with each other, (though it may cause group isolation).

Sharding is a centralized solution. If we get rid of properties of clustering that we don’t like, sharding is what we get. Data is distributed manually and does not move. Nodes are not aware of each other.

2.2.2 NoSQL

In a regular Internet service, the read write ratio is about 100:1 to 1000:1. However, when reading from a hard disk, a database join operation is time consuming, and 99% of the time is spent on disk seek. Not to mention a distributed join operation across networks.

To optimize the read performance, denormalization is introduced by adding redundant data or by grouping data. These four categories of NoSQL are here to help.

Key-value Store

The abstraction of a KV store is a giant hashtable/hashmap/dictionary.

The main reason we want to use a key-value cache is to reduce latency for accessing active data. Achieve an O(1) read/write performance on a fast and expensive media (like memory or SSD), instead of a traditional O(logn) read/write on a slow and cheap media (typically hard drive).

There are three major factors to consider when we design the cache.

  1. Pattern: How to cache? is it read-through/write-through/write-around/write-back/cache-aside?
  2. Placement: Where to place the cache? client side/distinct layer/server side?
  3. Replacement: When to expire/replace the data? LRU/LFU/ARC?

Out-of-box choices: Redis/Memcache? Redis supports data persistence while memcache does not. Riak, Berkeley DB, HamsterDB, Amazon Dynamo, Project Voldemort, etc.

Document Store

The abstraction of a document store is like a KV store, but documents, like XML, JSON, BSON, and so on, are stored in the value part of the pair.

The main reason we want to use a document store is for flexibility and performance. Flexibility is obtained by schemaless document, and performance is improved by breaking 3NF. Startup’s business requirements are changing from time to time. Flexible schema empowers them to move fast.

Out-of-box choices: MongoDB, CouchDB, Terrastore, OrientDB, RavenDB, etc.

Column-oriented Store

The abstraction of a column-oriented store is like a giant nested map: ColumnFamily<RowKey, Columns<Name, Value, Timestamp>>.

The main reason we want to use a column-oriented store is that it is distributed, highly-available, and optimized for write.

Out-of-box choices: Cassandra, HBase, Hypertable, Amazon SimpleDB, etc.

Graph Database

As the name indicates, this database’s abstraction is a graph. It allows us to store entities and the relationships between them.

If we use a relational database to store the graph, adding/removing relationships may involve schema changes and data movement, which is not the case when using a graph database. On the other hand, when we create tables in a relational database for the graph, we model based on the traversal we want; if the traversal changes, the data will have to change.

Out-of-box choices: Neo4J, Infinitegraph, OrientDB, FlockDB, etc.

2.3 CAP Theorem

When we design a distributed system, trading off among CAP (consistency, availability, and partition tolerance) is almost the first thing we want to consider.

  • Consistency: all nodes see the same data at the same time
  • Availability: a guarantee that every request receives a response about whether it succeeded or failed
  • Partition tolerance: system continues to operate despite arbitrary message loss or failure of part of the system

In a distributed context, the choice is between CP and AP. Unfortunately, CA is just a joke, because single point of failure is a red flag in the real distributed systems world.

To ensure consistency, there are some popular protocols to consider: 2PC, eventual consistency (vector clock + RWN), Paxos, In-Sync Replica, etc.

To ensure availability, we can add replicas for the data. As to components of the whole system, people usually do cold standby, warm standby, hot standby, and active-active to handle the failover.

3 Future Studies

Just as everything else, preparation and practice makes perfect. This article just provides limited resources for your reference. Please keep learning, and enjoy 😉

EOF