Design Pastebin

Design url shortener - is a similar question, except pastebin requires storing the paste contents instead of the original unshortened url.

Step 1: Outline Use cases and Constraints

Functional requirements

  • User enters block of text and gets a randomly generated url
  • User enters a paste's url and view its content
  • User login is optional

Non-Functional requirements

  • Expiration
    • Default settings doesn't expire
    • User specifies a time expiration
  • Service tracks analytics of pages
    • Like monthly visit stats, countries, platform etc.
  • Service should delete expired pastes
  • Service has high availability

Constraints

  • Traffic is not evenly distributed
  • Following a short link should be fast
  • Pastes are text only
  • Page view analytics do not need to be realtime

Assumptions and Estimation

  • 10 million users
  • 10 million paste writes per month
  • 100 million paste reads per month
  • 10:1 read to write ratio

Calculate usage

  • Size per paste
    • Roughly 1.5 KB
  • New paste per month
    • Roughly 15 GB = 10M new paste * 1.5 KB per paste
    • About 180 GB per year = 15GB * 12
    • Assume most are new pastes instead of updates to existing ones
  • 4 paste writes per second on average
    • 10M / 30 days 24 hours 3600 secs
  • 40 read requests per second on average

Handy conversion guide

  • 2.5 million seconds per month
  • 1 request per second = 2.5 million requests per month
  • 40 requests per second = 100 million requests per month
  • 400 requests per second = 1 billion requests per month

Step 2: Create a high level design

Outline a high level design with all important components.

Step 3: Design core components

We could use a relational database as a large hash table, mapping the generated url to a file server and path containing the paste file.

Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.

An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL.

The following discussion uses the relational database approach:

  • The Client sends a create paste request to the Web Server, running as a reverse proxy
  • The Web Server forwards the request to the Write API server
  • The Write API server does the following:
    • Generates a unique url
      • Checks if the url is unique by looking at the SQL Database for a duplicate
      • If the url is not unique, it generates another url
      • If we supported a custom url, we could use the user-supplied (also check for a duplicate)
    • Saves to the SQL Database pastes table
    • Saves the paste data to the Object Store
    • Returns the url

Clarify with your interviewer how much code you are expected to write.

We'll create an index on shortlink and created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer

Generate Unique URL

  • Take the MD5 hash of the user's ip_address + timestamp
    • MD5 is a widely used hashing function that produces a 128-bit hash value
    • MD5 is uniformly distributed
    • Alternatively, we could also take the MD5 hash of randomly-generated data
  • Base 62 encode the MD5 hash
    • Base 62 encodes to [a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters
    • There is only one hash result for the original input and and Base 62 is deterministic (no randomness involved)
    • Base 64 is another popular encoding but provides issues for urls because of the additional + and / characters
    • Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:

We'll use a public REST API:

$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
    "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste

Response:

{
    "shortlink": "foobar"
}

For internal communications, we could use Remote Procedure Calls.

Use case: User enters a paste's url and views the contents

  • The Client sends a get paste request to the Web Server The Web Server forwards the request to the Read API server
  • The Read API server does the following:
    • Checks the SQL Database for the generated url
      • If the url is in the SQL Database, fetch the paste contents from the Object Store
      • Else, return an error message for the user

REST API:

$ curl https://pastebin.com/api/v1/paste?shortlink=foobar

Response:

{
    "paste_contents": "Hello World"
    "created_at": "YYYY-MM-DD HH:MM:SS"
    "expiration_length_in_minutes": "60"
}

Use case: Service tracks analytics of pages

Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.

Use case: Service deletes expired pastes To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.

Step 4: Scale the design

results matching ""

    No results matching ""