Description
consul version
for both Client and Server
Client: v0.7.5
Server: v0.7.5
consul info
for both Client and Server
Client:
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 1
build:
prerelease =
revision = '21f2d5a
version = 0.7.5
consul:
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 192.168.200.101:8300
server = true
raft:
applied_index = 31190
commit_index = 31190
fsm_pending = 0
last_contact = 20.657551ms
last_log_index = 31190
last_log_term = 18
last_snapshot_index = 24964
last_snapshot_term = 18
latest_configuration = [{Suffrage:Voter ID:192.168.200.101:8300 Address:192.168.200.101:8300} {Suffrage:Voter ID:192.168.200.102:8300 Address:192.168.200.102:8300}]
latest_configuration_index = 610
num_peers = 1
protocol_version = 1
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 18
runtime:
arch = amd64
cpu_count = 2
goroutines = 59
max_procs = 2
os = linux
version = go1.7.5
serf_lan:
encrypted = false
event_queue = 0
event_time = 3
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 5
members = 2
query_queue = 0
query_time = 1
serf_wan:
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 2
members = 1
query_queue = 0
query_time = 1
Server:
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 1
build:
prerelease =
revision = '21f2d5a
version = 0.7.5
consul:
bootstrap = true
known_datacenters = 1
leader = true
leader_addr = 192.168.200.101:8300
server = true
raft:
applied_index = 31169
commit_index = 31169
fsm_pending = 0
last_contact = 0
last_log_index = 31169
last_log_term = 18
last_snapshot_index = 24964
last_snapshot_term = 18
latest_configuration = [{Suffrage:Voter ID:192.168.200.101:8300 Address:192.168.200.101:8300} {Suffrage:Voter ID:192.168.200.102:8300 Address:192.168.200.102:8300}]
latest_configuration_index = 610
num_peers = 1
protocol_version = 1
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 18
runtime:
arch = amd64
cpu_count = 2
goroutines = 65
max_procs = 2
os = linux
version = go1.7.5
serf_lan:
encrypted = false
event_queue = 0
event_time = 3
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 5
members = 2
query_queue = 0
query_time = 1
serf_wan:
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 2
members = 1
query_queue = 0
query_time = 1
Operating system and Environment details
I use 2 centos 7.3 vms named n1 and n2:
#uname -a
Linux n1 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
#rpm --query centos-release
centos-release-7-3.1611.el7.centos.x86_64
Description of the Issue (and unexpected/desired result)
We are developing a distributed system on top of consul for which data consistency is critical.
We run different tests (in this case with 2 consul cluster members) where write a lot of keys using local clients of our application and simulate different network problems. In the end we collect all the data and and verify its consistency (unconfirmed writes/ lost writes etc). We ran into a problem where we disconnected our two cluster nodes and got a HTTP 500 status response from the consul kv api when we tried to write a key:
017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message [*3\r\n$3\r\nSET\r\n$20\r\nkey_n1_1491928360134\r\n$16\r\nn1_1491928360134\r\n]
2017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message as [LEADER]
2017-04-11T16:32:42.678Z [ClientHandler-92] DEBUG [ConsulKvStore] setKv failed: key [service/statestore/log/0/000000000000000018100000] value [[B@16f977c0]
com.ecwid.consul.v1.OperationException: OperationException(statusCode=500, statusMessage='Internal Server Error', statusContent='leadership lost while committing log')
We assume that this write failed and send an error message back to our client but after reconnecting the cluster we notice that the key has actually been written.
#curl -s localhost:8500/v1/kv/service/statestore/log/0/000000000000000018100000 | python -m json.tool
[
{
"CreateIndex": 383,
"Flags": 0,
"Key": "service/statestore/log/0/000000000000000018100000",
"LockIndex": 0,
"ModifyIndex": 383,
"Value": "KjMNCiQzDQpTRVQNCiQyMA0Ka2V5X24xXzE0OTE5MjgzNjAxMzQNCiQxNg0KbjFfMTQ5MTkyODM2MDEzNA0K"
}
]
Our problem is that we do not know if the write operation was successful when we receive an HTTP 500 status code from the consul kv API. Is it safe to assume that the write was successful or is it possible that the key is not in the database after the cluster is healthy again? We might also replace one of the nodes after it failed and I can imagine that the key could be lost after doing that.
I would like to know if we have any guarantee about the state of a key if we receive a HTTP-500 status (with message 'leadership lost while committing log') after sending a PUT request to the consul kv API and if there is a way to avoid this situation.
Reproduction steps
- Create two VMs within the same network
- Setup a two node consul cluster
- Start sending PUT requests for unique keys on each node (using curl or something else)
- Disconnect the network between the nodes
- Check if one of the nodes reported a HTTP-500 status code with the message 'leadership lost while committing log' (otherwise restart the test)
- Repair the network
- Check if the key which had a HTTP-500 response is in the consul kv store (We assumed it would not be there but (at least in some cases) it has been written).
Log Fragments or Link to gist
Log from our own application
2017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message [*3\r\n$3\r\nSET\r\n$20\r\nkey_n1_1491928360134\r\n$16\r\nn1_1491928360134\r\n]
2017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message as [LEADER]
2017-04-11T16:32:42.678Z [ClientHandler-92] DEBUG [ConsulKvStore] setKv failed: key [service/statestore/log/0/000000000000000018100000] value [[B@16f977c0]
com.ecwid.consul.v1.OperationException: OperationException(statusCode=500, statusMessage='Internal Server Error', statusContent='leadership lost while committing log')
Consul log from syslog
Apr 11 16:32:42 default-centos-7 consul[7046]: memberlist: Suspect n2 has failed, no acks received
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: Failed to contact 192.168.200.102:8300 in 2.500219397s
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: Failed to contact quorum of nodes, stepping down
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: Node at 192.168.200.101:8300 [Follower] entering Follower state (Leader: "")
Apr 11 16:32:42 default-centos-7 consul[7046]: consul: cluster leadership lost
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: aborting pipeline replication to peer {Voter 192.168.200.102:8300 192.168.200.102:8300}
Apr 11 16:32:42 default-centos-7 consul[7046]: consul.kvs: Apply failed: leadership lost while committing log
Apr 11 16:32:42 default-centos-7 consul[7046]: http: Request PUT /v1/kv/service/statestore/log/0/000000000000000018100000, error: leadership lost while committing log from=127.0.0.1:50176
Apr 11 16:32:45 default-centos-7 consul[7046]: memberlist: Suspect n2 has failed, no acks received
Apr 11 16:32:47 default-centos-7 consul[7046]: memberlist: Marking n2 as failed, suspect timeout reached (0 peer confirmations)
Apr 11 16:32:47 default-centos-7 consul[7046]: serf: EventMemberFailed: n2 192.168.200.102
Apr 11 16:32:47 default-centos-7 consul[7046]: consul: Removing LAN server n2 (Addr: tcp/192.168.200.102:8300) (DC: dc0)
Apr 11 16:32:49 default-centos-7 consul[7046]: serf: attempting reconnect to n2 192.168.200.102:8301
Apr 11 16:32:49 default-centos-7 consul[7046]: memberlist: Suspect n2 has failed, no acks received
Apr 11 16:32:50 default-centos-7 consul[7046]: agent: coordinate update error: No cluster leader
Apr 11 16:32:50 default-centos-7 consul[7046]: raft: Failed to heartbeat to 192.168.200.102:8300: read tcp 192.168.200.101:33154->192.168.200.102:8300: i/o timeout
Apr 11 16:32:51 default-centos-7 consul[7046]: raft: Heartbeat timeout from "" reached, starting election
Apr 11 16:32:51 default-centos-7 consul[7046]: raft: Node at 192.168.200.101:8300 [Candidate] entering Candidate state in term 3
Apr 11 16:32:57 default-centos-7 consul[7046]: raft: Election timeout reached, restarting election
Activity