Design a URL shortener

Step 1: Use cases and Constraints

Clarify requirements

  • Will people be able to specify their own short URLs or will it all be auto-generated?
  • Will you need to keep track of any stats on the clicks?
  • Should the URLs stay alive forever, or do they have a timeout?

Identify Use cases

Functinal Requirements

  • Shortening a URL to a tiny URL: take a url => return a much shorter url
  • Retrieving the URL associated with a TinyURL and redirect: take a short url => redirect to the original url
  • Users should optionally be able to pick a custom short link for their URL.
  • Links will expire after a standard default timespan.

Non-Functional Requirements

  • The system should be highly available.
  • URL redirection should happen in real-time with minimal latency.
  • Shortened links should not be predictable

Extended Requirements

  • Analytics; e.g., click counts, country, platform,
  • Our service should also be accessible through REST APIs by other services.

Step 2: Capacity Estimation and Constraints

Our system will be read-heavy. There will be lots of redirection requests compared to new URL shortenings. Let's assume 100:1 ratio between read and write.

Trafic Estimation

  • Assume new url shortening per month: 500M
  • The we can expect 100 * 100M redirections per month
  • New shortening per Second: 500M / (30 24 60 * 60) ~ 200 URL/s
  • URLs redirections per second : 100 * 500M / 1 month ~ 19 K/s

Storage Estimation

  • Assume we store every URL shortening request (and associated shortened link) for 5 years.
  • Total storage would be: 500M 5 years 12 months = 30B URLs
  • Assume each stored object will be approximately 500 bytes.
  • So we need 30B * 500 bytes = 15 TB

Bandwidth estimates

  • Since we expect 200 new url shortening per second, total incoming data for our service will be: 0.5K * 200 = 100KB/s
  • For read requests, we expect 19K * 500 bytes = ~9MB/s

Memory estimates

  • How much memory will we need to store hot URLs that are frequently accessed?
  • We can follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs.
  • 19K 3600 seconds 24 hours = ~1.7 billion redirections per day
  • To cache 20% of these requests, we will need 170GB of memory: 0.2 1.7 billion 500 bytes = ~170GB

High level estimates

Step 3: Design System APIs

REST APIs to expose the functionality of our service. Following could be the definitions of the APIs for creating and deleting URLs

Step 4: Database Design

Defining the DB schema in the early stages of the interview would help to understand the data flow among various components and later would guide towards the data partitioning.

A few observations about the nature of the data we will store:

  • We need to store billions of records.
  • Each object we store is small (less than 1K).
  • There are no relationships between records—other than storing which user created a URL.
  • Our service is read-heavy.

Database Schema

{
  url: string
  createdAt: Date
  expirationDate: Date
  userID: int
}

{
  userID: int
  email: string
  createdAt: string
  expirationDate: Date
}

Since we anticipate storing billions of rows, and we don’t need to use relationships between objects – a NoSQL key-value store like Dynamo or Cassandra is a better choice. A NoSQL choice would also be easier to scale.

Step 5: Basic System Design and Algorithm

The problem we are solving here is: how to generate a short and unique key for a given URL?

a. Encoding actual URL

  • We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL.
  • The hash can then be encoded for displaying.

What are different issues with our solution?

  • If multiple users enter the same URL, they can get the same shortened URL, which is not acceptable.
  • What if parts of the URL are URL-encoded?

b. Generating keys offline

  • We can have a standalone Key Generation Service (KGS) that generates random six letter strings beforehand and stores them in a database (let’s call it key-DB).
  • Whenever we want to shorten a URL, we will just take one of the already-generated keys and use it.
  • KGS will make sure all the keys inserted into key-DB are unique

Servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys. As soon as KGS gives keys to one of the servers, it can move them to the used keys table. KGS can always keep some keys in memory so that it can quickly provide them whenever a server needs them.

What would be the key-DB size? With base64 encoding, we can generate 68.7B unique six letters keys. If we need one byte to store one alpha-numeric character, we can store all these keys in: 6 (characters per key) * 68.7B (unique keys) = 412 GB.

Isn’t KGS the single point of failure? Yes, it is. To solve this, we can have a standby replica of KGS. Whenever the primary server dies, the standby server can take over to generate and provide keys.

How would we perform a key lookup? We can look up the key in our database or key-value store to get the full URL. If it’s present, issue an “HTTP 302 Redirect” status back to the browser, passing the stored URL in the “Location” field of the request. If that key is not present in our system, issue an “HTTP 404 Not Found” status, or redirect the user back to the homepage.

Step 6: Data Partitioning and Replication

To scale out our DB, we need to partition it so that it can store information about billions of URLs. We need to come up with a partitioning scheme that would divide and store our data to different DB servers.

Range Based Partitioning

  • We can store URLs in separate partitions based on the first letter of the URL or the hash key.
  • Hence we save all the URLs starting with letter ‘A’ in one partition, save those that start with letter ‘B’ in another partition and so on.
  • We can even combine certain less frequently occurring letters into one database partition.
  • The main problem with this approach is that it can lead to unbalanced servers.

Hash-Based Partitioning

  • We take a hash of the object we are storing.
  • We then calculate which partition to use based upon the hash.
  • In our case, we can take the hash of the ‘key’ or the actual URL to determine the partition in which we store the data object.
  • Our hashing function will randomly distribute URLs into different partitions (e.g., our hashing function can always map any key to a number between [1…256]), and this number would represent the partition in which we store our object.

Step 7: Cache

We can cache URLs that are frequently accessed.

How much cache should we have? As estimated above, we need 170GB memory to cache 20% of daily traffic. Since a modern day server can have 256GB memory, we can easily fit all the cache into one machine.

Which cache eviction policy would best fit our needs?

  • Least Recently Used (LRU) can be a reasonable policy for our system

How can each cache replica be updated? Whenever there is a cache miss, our servers would be hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all the cache replicas. Each replica can update their cache by adding the new entry. If a replica already has that entry, it can simply ignore it.

Step 8: Load Balancer

We can add a Load balancing layer at three places in our system:

  • Between Clients and Application servers
  • Between Application Servers and database servers
  • Between Application Servers and Cache servers

Initially, we could use a simple Round Robin approach that distributes incoming requests equally among backend servers. This LB is simple to implement and does not introduce any overhead. Another benefit of this approach is, if a server is dead, LB will take it out of the rotation and will stop sending any traffic to it.

Step 9: Purging or DB cleanup

results matching ""

    No results matching ""