System design interview- tinyurl service

Tags: , , ,

How to design a tinyurl service is one of the most popular interview questions. It is often asked in the third or fourth round as the big data design interview question. To master tinyurl design, you need to be familiar with Base Conversion of Numbers, and Consistent Hash algorithm

So what is tinyutl service?  It is a URL service that can provide a map between a shorter and unique url to a long URL provided by a user. 

For instance, “http://wp.me/p7Eshi-o3” is the tiny url for the current page. Typically, the postfix of a tinyurl is required to be a combination of 6 letters from [0-9, a-z, A-Z].  There are 62^6 ~= 56.8 billions uniques string in total. 

How to design a tinyurl service on single machine

On a single machine, we can easily design a database with three columns: id, original url, and the shorten url. 

If we let id auto increment, we can get the unique shorten url by designing a function to uniquely map the id to the shorten url. 

As the id are unique integers, the shorten url are unique strings from [0-9, a-z, A-Z], this can be solved by using the base conversion algorithm. 

The ID is a decimal integer, and the shorten url is a 62-base number. The map is:

The following implements the algorithm to transfer a 10-base integer into a 62-base number given the map described above. 

The output:

Time complexity analysis:

For an original url, its id is auto generated (in O(1) time). The base conversion algorithm runs in O(k) time where k is the number of digits (i.e. k=6).

How to design tinyurl on multiple machines

The single machine based method does not work when there are too many traffics. It is necessary to design a scalable method using multiple servers. 

When dealing with large scale data allocated on multiple servers, a popular method is based on distributed hash table. The most famous algorithm to implement a distributed hash table is based on the consistent hash technique, which hashes both the inputs and servers (e.g., denoted by IP) into the same hash space. As the hash values increase from 0 to N (e.g, 2^32), these hash values form a ring or circle. 

How to locate the server for a datapoint

For any data points and server, they are hashed to the hash ring using the same hash function.  The corresponding server for a datapoint is the first server on the ring in the clock wise direction. 

For instance, suppose two neighbored servers have hash value S and S + K, then for any data with hashed value in the range [S + 1, S + K], they will be served by the server with hash value S + K.  

In the following figure, we have four Servers A, B, C and D. Their positions on the hash ring are shown in the following figure.  For the two data points K1 and K2, they are hashed to the ring using the same Hash function for hashing the servers.  Based on the rule described above, K1, and K2 will be served by Server B

consistent hash figure

consistent hash figure

 

Please refer to this post for how to implement consistent hash

How to use Consistent Hash to design a distributed tinyurl service

In the tinyurl generation phrase, for an original url:

  1. hash the url into a single integer, denoted as hasheURL
  2. locate the server on the consistent hash ring:  the server that is closest to hashedURL in the right clock direction on the hash ring.  And store the url on the server using the hashedURL as the key.
  3. Compute the tinyurl using the base conversion algorithm (from 10 based to 62 based). 
  4. return the tinyurl to the user

In the url search phrase:

  1. Convert the shorten url back to the hashedURL (from 62 base to 10 base)
  2. Use the hashedURL to locate the server that stores the original  URL, and return it.