System Design : WhatsApp

WhatsApp is one of the most popular messaging platforms globally, enabling users to send text messages, voice notes, images, videos, and make voice and video calls. With over 2 billion active users, the system needs to be highly scalable, resilient, secure, and fast, handling a variety of message types in real-time across different devices and networks. This design focuses on modern FANG (Facebook, Amazon, Netflix, Google) standards, ensuring that the system is capable of scaling horizontally, providing low latency, and maintaining data security.

Key Requirements

1. Scalability: Handle millions of concurrent users and messages in real-time across global geographies.


2. Low Latency: Ensure that message delivery is fast, with almost no delay for text, multimedia, or voice communication.


3. High Availability and Fault Tolerance: System must be operational 24/7, even in case of hardware or network failures.


4. Security: End-to-end encryption for messages, ensuring privacy and data protection.


5. Data Consistency: Real-time message delivery and synchronization across devices while keeping data consistent.


6. Global Availability: The system must support users across different time zones, languages, and locales.



System Components and Architecture




1. Client Applications:

Purpose: WhatsApp clients run on mobile (iOS, Android), desktop, and web.

Responsibilities:

Sending and receiving messages in real-time.

Syncing messages across multiple devices.

Encrypting messages before sending and decrypting upon receipt.


Technology:

Native iOS (Swift) and Android (Kotlin) apps.

Web client (ReactJS) with WebSockets for real-time communication.



2. API Gateway:

Purpose: Serves as the entry point for all API requests from client apps, including login, message sending, and synchronization.

Responsibilities:

Routes requests to the appropriate microservice (authentication, messaging, notifications, etc.).

Implements API rate limiting, authentication, and authorization.


Technology:

Nginx or AWS API Gateway for handling load balancing and routing.

gRPC for fast, low-latency communication between internal services.



3. Authentication and Authorization:

Purpose: Verifies user identity and ensures secure access to messages and personal data.

Responsibilities:

OAuth2.0-based authentication (JWT tokens for session management).

Multi-factor authentication (MFA) for added security.

Authorization to ensure users can only access their own messages.


Technology:

OAuth2.0, JWT, Firebase Authentication (for mobile and web clients).

Two-Factor Authentication (TFA) for securing user accounts.


4. User and Message Database:

Purpose: Stores user profiles, messages, group data, media files, and other metadata.

Responsibilities:

Maintain user account details, including phone numbers, contacts, and status updates.

Store text and media messages in an efficient, scalable manner.

Store encrypted message data for persistence and synchronization across devices.


Technology:

Cassandra for distributed storage of text messages (due to its high write throughput and horizontal scalability).

AWS S3 for media file storage (images, videos, voice messages).

SQLite or LevelDB for local storage on mobile clients.



5. Message Queuing and Event Streaming:

Purpose: Ensures efficient, reliable, and fault-tolerant message delivery and synchronization.

Responsibilities:

Queue messages and events for real-time delivery and synchronization.

Handle retries and message acknowledgments for reliable delivery.


Technology:

Apache Kafka or RabbitMQ for managing real-time message streams.

Amazon SNS for sending push notifications.


6. Message Delivery and Synchronization Service:

Purpose: Ensures messages are delivered and synchronized across devices in real-time.

Responsibilities:

Delivery confirmation mechanism (e.g., ticks: sent, delivered, read).

Sync messages between devices, maintaining state consistency.

Handling network partitions and retries.


Technology:

WebSockets or gRPC for real-time messaging.

Redis for caching delivery state and message status across devices.

Eventual consistency model with Quorum-based writes for conflict resolution.


7. End-to-End Encryption (E2EE) Service:

Purpose: Ensures that all messages and media shared between users are encrypted and can only be decrypted by the recipient.

Responsibilities:

Encrypt all messages on the client-side before sending.

Decrypt messages on the recipient side.

Secure key management using asymmetric encryption (public and private keys).


Technology:

AES-256 for symmetric encryption and RSA or ECDSA for asymmetric encryption.

Cryptographic libraries (e.g., OpenSSL, Libsodium) for encryption.


8. Push Notification Service:

Purpose: Sends real-time notifications to users about new messages, calls, and app updates.

Responsibilities:

Handle push notifications for mobile and web clients.

Ensure that notifications are sent efficiently and reliably.


Technology:

Firebase Cloud Messaging (FCM) for push notifications.

Amazon SNS for sending notifications across various channels.


9. Group Management Service:

Purpose: Handles group chats, allowing multiple users to chat together.

Responsibilities:

Managing group creation, membership, and user roles (admin, member).

Handling group messages and media sharing.

Synchronizing messages within groups across devices.


Technology:

Cassandra or CockroachDB for storing group data.

Redis for caching group memberships and notifications.


10. Media Management and CDN:

Purpose: Handles storing and delivering media files (photos, videos, voice messages).

Responsibilities:

Store media files and optimize them for delivery (e.g., resizing images, compressing videos).

Deliver media files via Content Delivery Network (CDN) for low-latency access.


Technology:

AWS S3 or Google Cloud Storage for media storage.

Cloudflare CDN or AWS CloudFront for optimized media delivery.




11. Data Analytics and Monitoring:

Purpose: Monitors system health, user behavior, and message patterns. Provides insights into system performance and user activity.

Responsibilities:

Monitoring system performance (e.g., message delivery times, server health).

Analyzing user interactions for features like spam detection, user retention, etc.

Collecting usage statistics for optimizing the user experience.


Technology:

Prometheus for monitoring system health.

Apache Spark or Google BigQuery for data analytics.

Grafana for dashboard visualization.



Data Flow Diagram:

+————————–+        +—————————–+
|    WhatsApp Clients      | <—-> |   API Gateway & Load Balancer |
+————————–+        +—————————–+
            |                                    |
            v                                    v
+————————–+         +—————————-+
|   Authentication Service |         | Message Queuing & Event    |
|     (OAuth, JWT)         |         | Streaming Service          |
+————————–+         +—————————-+
            |                                    |
            v                                    v
+————————–+         +—————————-+
|    User & Message DB     | <—-> |    Message Delivery & Sync  |
|   (Cassandra, SQLite)    |         |    Service (WebSockets, gRPC)|
+————————–+         +—————————-+
            |                                    |
            v                                    v
+————————–+         +—————————-+
|    E2EE Service          |         |     Push Notification      |
|    (AES, RSA)            | <—-> |     Service                |
+————————–+         +—————————-+
            |                                    |
            v                                    v
+————————–+         +—————————-+
|    Group Management      |         |    Media Management & CDN  |
|    Service               |         |    (AWS S3, Cloudflare CDN)|
+————————–+         +—————————-+
            |
            v
+————————–+ 
| Data Analytics & Monitoring|
|   (Prometheus, Spark)     |
+————————–+

Infrastructure and Deployment:

1. Microservices: Each module (Authentication, Messaging, Media, etc.) is deployed as an independent microservice, running in Docker containers orchestrated by Kubernetes for scalability and fault tolerance.


2. Databases:

Cassandra for distributed message storage.

Redis for caching delivery state.

SQLite for local storage on mobile clients.



3. Media Storage: AWS S3 or Google Cloud Storage for storing media files.


4. CDN: Cloudflare or AWS CloudFront for global media delivery.


5. Message Queue: Kafka or RabbitMQ for event-driven messaging and delivery.

The article above is rendered by  outputs of 1 HUMAN AGENT & 3 AI AGENTS, an amalgamation of HGI and AI to serve technology education globally.

(Article By : Himanshu N)