Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Raft cluster peer management (GetPeers, AddPeer, RemovePeer) #663

Merged
merged 49 commits into from
Feb 23, 2025

Conversation

sinadarbouy
Copy link
Collaborator

@sinadarbouy sinadarbouy commented Feb 18, 2025

Ticket(s)

#642

Description

This PR implements Raft peer management APIs to enable adding, removing, and querying peers in the Raft cluster. Key changes include:

  • Added new API endpoints for Raft peer management:
    • GetPeers: Returns information about all peers in the Raft cluster
    • AddPeer: Adds a new peer to the Raft cluster
    • RemovePeer: Removes a peer from the Raft cluster
  • Added corresponding protobuf definitions and messages
  • Added comprehensive test coverage for the new APIs
  • Updated API documentation
  • Added raft/node* to .gitignore
  • Updated Dockerfile to use version ranges for dependencies

Development Checklist

  • I have added a descriptive title to this PR.
  • I have squashed related commits together.
  • I have rebased my branch on top of the latest main branch.
  • I have performed a self-review of my own code.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added docstring(s) to my code.
  • I have made corresponding changes to the documentation (docs).
  • I have updated docs using make gen-docs command.
  • I have added tests for my changes.
  • I have signed all the commits.

Legal Checklist

sinadarbouy and others added 30 commits December 24, 2024 16:21
Adds support for automatic peer discovery and cluster joining for non-bootstrap nodes.
Key changes:

- Add AddPeer RPC endpoint to allow nodes to join an existing cluster
- Implement TryConnectToCluster() to handle automatic cluster joining
- Forward AddPeer requests to leader if received by follower
- Add protobuf definitions for AddPeer request/response
- Update .gitignore to exclude raft node data files

This change allows new nodes to automatically discover and join an existing cluster
by attempting to connect to configured peers until successful. Non-leader nodes
will forward join requests to the current leader.
Add unit tests to verify the AddPeer behavior in both leader and follower nodes:
- Test successful peer addition when node is leader
…ower nodes

- Updated TestAddPeer to include checks for adding peers when the node is a leader and a follower.
- Introduced temporary directories for each node to ensure isolated testing environments.
- Added assertions to confirm that both new peers are successfully integrated into the cluster.
- Improved test reliability by implementing a loop to wait for both nodes to join the cluster before completing the test.
- Added RemovePeer RPC endpoint to the Raft service, allowing nodes to remove peers from the cluster.
- Introduced RemovePeerRequest and RemovePeerResponse message types in the protobuf definitions.
- Updated RaftNode to handle peer removal, including forwarding requests to the leader if the node is not the leader.
- Enhanced the README documentation to include details about the new RemovePeerRequest and RemovePeerResponse.
- Implemented unit tests for the RemovePeer functionality, ensuring correct behavior when removing both leader and follower nodes.
- Updated gRPC and HTTP handlers to support the new RemovePeer functionality.

This change enhances the Raft protocol's capability to manage cluster membership dynamically.
Implement Raft cluster management API endpoints to retrieve, add, and remove peers:
- Add GetPeers method to retrieve current Raft cluster peers
- Implement AddPeer and RemovePeer RPC endpoints for dynamic cluster membership
- Update API service definition to include new Raft peer management methods
- Add corresponding gRPC and HTTP handlers for peer management
- Enhance protobuf definitions with new message types for peer operations

These changes provide a comprehensive API for managing Raft cluster membership, allowing dynamic peer addition and removal.
Implement peer discovery and graceful shutdown in raft.go
Configure Tempo tracing service with explicit endpoint binding and add health checks to Docker Compose. This ensures proper tracing integration and service readiness in the observability stack.
…and openssl

Modify Dockerfile to use more flexible version constraints for alpine packages, allowing minor version updates while maintaining compatibility.
Enhance Raft node configuration to support optional TLS encryption:
- Add IsSecure, CertFile, and KeyFile fields to Raft configuration
- Implement conditional TLS server credentials based on secure mode
- Update default configuration to disable secure mode
- Modify gRPC server startup to handle secure and insecure modes
- Improve logging for gRPC server initialization

This change provides flexibility in configuring Raft node communication security while maintaining backward compatibility.
Modify the HealthChecker to always return NOT_SERVING status by commenting out Raft-specific health checks.
Introduce getLeaderClient method to centralize leader client retrieval logic in AddPeer and RemovePeer methods. This reduces code duplication and improves maintainability by extracting the common pattern of finding the leader's gRPC address and obtaining a client.
Implement TestSecureGRPCConfiguration to validate secure Raft node configuration:
- Add test cases for valid and invalid secure configuration scenarios
- Introduce helper function to generate self-signed certificates for testing
- Verify TLS credential handling and error conditions
- Ensure proper configuration of secure and non-secure gRPC nodes
…c allocation

Modify TestSecureGRPCConfiguration to use port 0 for dynamic port allocation, improving test reliability and preventing potential port conflicts during parallel test execution.
…oring

 integrate Hashicorp's logger adapter using zerolog. This simplifies the Raft node initialization by leveraging built-in logging mechanisms and removing redundant leadership tracking logic.
…sertions

Enhance HTTP server test by:
- Adding error handling for gRPC server startup
- Using require assertions for clearer test failures
- Implementing panic recovery for gRPC server
- Improving server startup error detection
Remove the `monitorLeadership()` method from the Raft node initialization, which was previously commented out. This simplifies the node startup process and removes unnecessary leadership tracking logic that was likely superseded by more efficient Raft cluster management mechanisms.
Implement thorough test suite for RemovePeer API method, covering:
- Successful peer removal
- Error handling for uninitialized Raft node
- Handling of non-existent peer removal
- Proper gRPC error code validation
Implement thorough test suite for GetPeers API method, covering:
- Successful peer retrieval with a leader node
- Error handling for uninitialized Raft node
- Validation of returned peers map structure
- Proper gRPC error code validation
Enhance peer management functionality by introducing gRPC address tracking:
- Update AddPeer method to include gRPC address parameter
- Modify AddPeerRequest and related protobuf definitions
- Extend peer addition logic to store gRPC address in local peers list
- Update API and RPC methods to handle new gRPC address field
- Add comprehensive test cases for AddPeer with gRPC address validation
…error handling

Enhance peer management methods by:
- Adding context with timeout for AddPeer and RemovePeer operations
- Improving error messages with more context
- Using getter methods for request fields
- Updating test cases to reflect new method signatures
- Adding more robust error handling and logging
Implement TestFSMPeerOperations to validate Raft cluster peer management:
- Create a multi-node Raft cluster with bootstrap and follower nodes
- Verify peer synchronization across nodes
- Ensure consistent peer information in FSM state
- Validate leader election and consistency
- Add robust assertions for peer addition and state tracking
Improve peer management by:
- Adding CommandAddPeer and CommandRemovePeer to FSM
- Implementing peer synchronization across Raft cluster
- Adding waitForLeader method with retry mechanism
- Enhancing error handling and logging for peer operations
- Updating leader client retrieval with more reliable mechanism
Update Raft node and RPC methods to accept context parameter:
- Modify Apply method to include context for better request tracing and timeout control
- Update forwardToLeader and applyInternal methods to use context
- Adjust RPC server methods to pass context through
- Refactor test cases to provide context when calling Apply
- Improve error handling and request forwarding with context support
…ation

Implement a new GetPeerInfo RPC method to support peer synchronization across Raft cluster:
- Add GetPeerInfoRequest and GetPeerInfoResponse protobuf definitions
- Create RPC method to query peer information from other nodes
- Implement peer synchronization mechanism with periodic checks
- Add method to query and update peer information across cluster
- Enhance peer management with cross-node information retrieval
Update testcontainers-go dependency to the latest version, which includes potential bug fixes and improvements.
Remove the placeholder DiscoverPeers method that was not implemented, keeping the codebase clean and focused on existing peer management functionality.
Improve peer synchronization and RPC method implementation:
- Remove error handling from syncPeers method
- Simplify StartPeerSynchronizer goroutine
- Update GetPeerInfo RPC method to use getter method
- Remove unnecessary logging and error checks
…guration management

Break down Raft node creation into smaller, focused functions:
- Extract node configuration initialization
- Create separate methods for FSM, stores, and transport setup
- Improve error handling and logging during node creation
- Add context cancellation for peer synchronization
- Enhance cluster configuration and bootstrapping logic
Modify package version constraints to use more flexible version matching for git, make, and openssl packages, allowing minor version updates while maintaining compatibility.
Update the protoc-gen-go version in generated protobuf files for both API and Raft services, ensuring compatibility and using the latest minor version.
Remove hardcoded ARM64 architecture setting in docker-compose-raft.yaml, allowing for more flexible deployment configurations.
Update load balancer strategies to accept a context parameter, enabling timeout and cancellation support for proxy selection. This change introduces context handling in:
- ConsistentHash
- Random
- RoundRobin
- WeightedRoundRobin

Also add a FindProxyTimeout constant in the server to provide a default timeout for proxy selection.
Add graceful handling for raft cluster bootstrap when the cluster is already initialized, preventing unnecessary errors and improving startup robustness. Log an informative message when skipping bootstrap due to existing cluster configuration.
Add comments to clarify the purpose of AddPeer and GetPeerInfo gRPC request handlers, improving code readability and documentation for Raft RPC server methods.
Improve input validation and error handling for Raft RPC methods:
- Add null and empty field checks for AddPeer and RemovePeer requests
- Provide more descriptive error messages
- Refactor GetPeerInfo to handle non-existent peer cases
- Ensure consistent error response formatting
…or creation

Simplify error creation in RPC methods by using errors.New instead of fmt.Errorf, improving code consistency and removing unnecessary formatting overhead.
Enhance Raft cluster operations with:
- Robust LeaveCluster method with timeout and logging
- Comprehensive peer validation in FSM
- Metrics tracking for peer additions, updates, and removals
- Improved error handling and state checks
- Added validation for peer payload addresses
…ions

Improve API documentation for Raft peer-related methods and messages:
- Add detailed descriptions and examples for GetPeers, AddPeer, and RemovePeer RPC methods
- Include comprehensive field descriptions for PeersResponse, PeerInfo, AddPeerRequest, and related response messages
- Update Swagger/OpenAPI specifications with more informative operation and schema descriptions
- Improve README.md documentation for peer-related message fields
Implement thorough test cases for LeaveCluster method covering:
- Single node cluster
- Follower leaving multi-node cluster
- Leader leaving multi-node cluster
- Handling nil node scenarios
- Verifying cluster state after node departure

Enhance test coverage for Raft cluster management and node removal logic.
Implement thorough test suites for Raft Node methods:
- GetPeers: Test peer retrieval in various cluster configurations
- GetLeaderClient: Verify leader client retrieval in single and multi-node clusters
- Shutdown: Validate node shutdown behavior with different scenarios

Enhance test coverage for Raft node management, improving reliability and robustness of cluster operations.
Implement thorough test suites for Raft RPC server methods:
- AddPeer: Test peer addition with various input scenarios
- RemovePeer: Validate peer removal in different conditions
- GetPeerInfo: Verify peer information retrieval

Enhance test coverage for Raft RPC server operations, improving reliability and robustness of cluster management methods.
Copy link

github-actions bot commented Feb 18, 2025

Overview

Image reference ghcr.io/gatewayd-io/gatewayd:991c067 gatewaydio/gatewayd:latest
- digest 2e33dffa8b4f 383013efa302
- tag 991c067 latest
- provenance b6df86a
- vulnerabilities critical: 0 high: 0 medium: 0 low: 0 critical: 1 high: 3 medium: 6 low: 0
- platform linux/amd64 linux/amd64
- size 20 MB 18 MB (-2.7 MB)
- packages 145 140 (-5)
Base Image alpine:3
also known as:
3.21
3.21.3
latest
alpine:3.20
also known as:
3
latest
- vulnerabilities critical: 0 high: 0 medium: 0 low: 0 critical: 0 high: 1 medium: 3 low: 0
Packages and Vulnerabilities (55 package changes and 4 vulnerability changes)
  • ➕ 2 packages added
  • ➖ 6 packages removed
  • ♾️ 47 packages changed
  • 87 packages unchanged
  • ❗ 4 vulnerabilities added
Changes for packages of type apk (19 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:991c067
Version
gatewaydio/gatewayd:latest
alpine-base 3.21.3-r0
♾️ alpine-baselayout 3.6.8-r1 3.6.5-r0
♾️ alpine-baselayout-data 3.6.8-r1 3.6.5-r0
♾️ alpine-keys 2.5-r0 2.4-r1
alpine-release 3.21.3-r0
♾️ apk-tools 2.14.6-r3 2.14.4-r0
♾️ busybox 1.37.0-r12 1.36.1-r29
♾️ busybox-binsh 1.37.0-r12 1.36.1-r29
ca-certificates 20241121-r1
♾️ ca-certificates-bundle 20241121-r1 20240705-r0
♾️ libcrypto3 3.3.3-r0 3.3.2-r0
♾️ libssl3 3.3.3-r0 3.3.2-r0
♾️ musl 1.2.5-r9 1.2.5-r0
♾️ musl-utils 1.2.5-r9 1.2.5-r0
critical: 0 high: 1 medium: 0 low: 0
Added vulnerabilities (1):
  • high : CVE--2025--26519
openssl 3.3.3-r0
pax-utils 1.3.8-r1
♾️ scanelf 1.3.8-r1 1.3.7-r2
♾️ ssl_client 1.37.0-r12 1.36.1-r29
♾️ zlib 1.3.1-r2 1.3.1-r1
Changes for packages of type golang (36 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:991c067
Version
gatewaydio/gatewayd:latest
♾️ github.com/envoyproxy/protoc-gen-validate 1.2.1 1.1.0
♾️ github.com/gatewayd-io/gatewayd (devel) 0.0.0-20241214123014-b6df86a6fe94
♾️ github.com/gatewayd-io/gatewayd-plugin-sdk 0.4.0 0.3.5
♾️ github.com/getsentry/sentry-go 0.31.1 0.30.0
♾️ github.com/go-git/go-billy/v5 5.6.0 5.5.0
♾️ github.com/go-git/go-git/v5 5.13.0 5.12.0
critical: 1 high: 1 medium: 0 low: 0
Added vulnerabilities (2):
  • critical : CVE--2025--21613
  • high : CVE--2025--21614
♾️ github.com/google/go-github/v53 68.0.0 53.2.0
♾️ github.com/grpc-ecosystem/grpc-gateway/v2 2.26.0 2.24.0
github.com/hashicorp/go-metrics 0.5.4
♾️ github.com/hashicorp/go-plugin 1.6.3 1.6.2
♾️ github.com/hashicorp/raft 1.7.2 1.7.1
♾️ github.com/hashicorp/raft-boltdb 0.0.0-20250113192317-e8660f88bcc9 0.0.0-20241202213821-f9dd2ba30efd
♾️ github.com/invopop/jsonschema 0.13.0 0.12.0
♾️ github.com/jackc/pgx/v5 5.7.2 5.7.1
♾️ github.com/mattn/go-colorable 0.1.14 0.1.13
♾️ github.com/pganalyze/pg_query_go/v5 6.0.0 5.1.0
♾️ github.com/prometheus/common 0.62.0 0.61.0
♾️ github.com/protonmail/go-crypto 1.1.3 1.0.0
♾️ github.com/spf13/cast 1.7.1 1.7.0
♾️ github.com/wasilibs/go-pgquery 0.0.0-20241226024732-8bfaa0ac5969 0.0.0-20241011013927-817756c5aae4
♾️ github.com/wasilibs/wazero-helpers 0.0.0-20250123031827-cd30c44769bb 0.0.0-20240620070341-3dff1577cd52
♾️ go.opentelemetry.io/otel 1.34.0 1.33.0
♾️ go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc 1.34.0 1.33.0
♾️ go.opentelemetry.io/otel/metric 1.34.0 1.33.0
♾️ go.opentelemetry.io/otel/sdk 1.34.0 1.33.0
♾️ go.opentelemetry.io/otel/trace 1.34.0 1.33.0
♾️ go.opentelemetry.io/proto/otlp 1.5.0 1.4.0
♾️ golang.org/x/crypto 0.32.0 0.31.0
golang.org/x/exp 0.0.0-20241210194714-1829a127f884
♾️ golang.org/x/net 0.34.0 0.32.0
critical: 0 high: 1 medium: 0 low: 0
Added vulnerabilities (1):
  • high : CVE--2024--45338
golang.org/x/oauth2 0.24.0
♾️ golang.org/x/sys 0.29.0 0.28.0
♾️ google.golang.org/genproto/googleapis/rpc 0.0.0-20250124145028-65684f501c47 0.0.0-20241209162323-e6fa225c2576
♾️ google.golang.org/grpc 1.70.0 1.69.0
♾️ google.golang.org/protobuf 1.36.4 1.35.2
♾️ stdlib go1.23.6 1.23.4

@sinadarbouy sinadarbouy marked this pull request as ready for review February 22, 2025 17:28
@sinadarbouy sinadarbouy requested a review from mostafa February 22, 2025 17:28
Simplify Raft RPC response messages by removing the redundant error field across multiple protobuf message types:
- ForwardApplyResponse
- AddPeerResponse
- RemovePeerResponse

Update related code to handle errors without relying on the error string field, improving error handling consistency and reducing unnecessary message complexity.
Modify Raft peer metrics to support node-specific tracking:
- Convert RaftPeerRemovals, RaftPeerAdditions, and RaftPeerUpdates to labeled CounterVec
- Remove redundant RaftPeerUpdates metric
- Update metric incrementation to include node ID labels
- Simplify peer tracking logic in FSM Apply method
Move the FindProxyTimeout constant from network/server.go to config/constants.go to centralize configuration and improve code organization. Update the server implementation to use the new constant location.
Copy link
Member

@mostafa mostafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for you contribution! LGTM!

@mostafa mostafa merged commit a0d19e5 into main Feb 23, 2025
4 checks passed
@mostafa mostafa deleted the feature/add-raft-cluster-peer-management branch February 23, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants