Skip to content

Potential inconsistency after 'leadership lost while committing log' #2893

Closed
@stefan-wenger

Description

consul version for both Client and Server

Client: v0.7.5
Server: v0.7.5

consul info for both Client and Server

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease = 
	revision = '21f2d5a
	version = 0.7.5
consul:
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 192.168.200.101:8300
	server = true
raft:
	applied_index = 31190
	commit_index = 31190
	fsm_pending = 0
	last_contact = 20.657551ms
	last_log_index = 31190
	last_log_term = 18
	last_snapshot_index = 24964
	last_snapshot_term = 18
	latest_configuration = [{Suffrage:Voter ID:192.168.200.101:8300 Address:192.168.200.101:8300} {Suffrage:Voter ID:192.168.200.102:8300 Address:192.168.200.102:8300}]
	latest_configuration_index = 610
	num_peers = 1
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 18
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 59
	max_procs = 2
	os = linux
	version = go1.7.5
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 3
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 5
	members = 2
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 2
	members = 1
	query_queue = 0
	query_time = 1

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease = 
	revision = '21f2d5a
	version = 0.7.5
consul:
	bootstrap = true
	known_datacenters = 1
	leader = true
	leader_addr = 192.168.200.101:8300
	server = true
raft:
	applied_index = 31169
	commit_index = 31169
	fsm_pending = 0
	last_contact = 0
	last_log_index = 31169
	last_log_term = 18
	last_snapshot_index = 24964
	last_snapshot_term = 18
	latest_configuration = [{Suffrage:Voter ID:192.168.200.101:8300 Address:192.168.200.101:8300} {Suffrage:Voter ID:192.168.200.102:8300 Address:192.168.200.102:8300}]
	latest_configuration_index = 610
	num_peers = 1
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 18
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 65
	max_procs = 2
	os = linux
	version = go1.7.5
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 3
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 5
	members = 2
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 2
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

I use 2 centos 7.3 vms named n1 and n2:

#uname -a
Linux n1 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
#rpm --query centos-release
centos-release-7-3.1611.el7.centos.x86_64

Description of the Issue (and unexpected/desired result)

We are developing a distributed system on top of consul for which data consistency is critical.
We run different tests (in this case with 2 consul cluster members) where write a lot of keys using local clients of our application and simulate different network problems. In the end we collect all the data and and verify its consistency (unconfirmed writes/ lost writes etc). We ran into a problem where we disconnected our two cluster nodes and got a HTTP 500 status response from the consul kv api when we tried to write a key:

017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message [*3\r\n$3\r\nSET\r\n$20\r\nkey_n1_1491928360134\r\n$16\r\nn1_1491928360134\r\n]
2017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message as [LEADER]
2017-04-11T16:32:42.678Z [ClientHandler-92] DEBUG [ConsulKvStore] setKv failed: key [service/statestore/log/0/000000000000000018100000] value [[B@16f977c0]
com.ecwid.consul.v1.OperationException: OperationException(statusCode=500, statusMessage='Internal Server Error', statusContent='leadership lost while committing log')

We assume that this write failed and send an error message back to our client but after reconnecting the cluster we notice that the key has actually been written.

#curl -s localhost:8500/v1/kv/service/statestore/log/0/000000000000000018100000 | python -m json.tool
[
    {
        "CreateIndex": 383,
        "Flags": 0,
        "Key": "service/statestore/log/0/000000000000000018100000",
        "LockIndex": 0,
        "ModifyIndex": 383,
        "Value": "KjMNCiQzDQpTRVQNCiQyMA0Ka2V5X24xXzE0OTE5MjgzNjAxMzQNCiQxNg0KbjFfMTQ5MTkyODM2MDEzNA0K"
    }
]

Our problem is that we do not know if the write operation was successful when we receive an HTTP 500 status code from the consul kv API. Is it safe to assume that the write was successful or is it possible that the key is not in the database after the cluster is healthy again? We might also replace one of the nodes after it failed and I can imagine that the key could be lost after doing that.
I would like to know if we have any guarantee about the state of a key if we receive a HTTP-500 status (with message 'leadership lost while committing log') after sending a PUT request to the consul kv API and if there is a way to avoid this situation.

Reproduction steps

  • Create two VMs within the same network
  • Setup a two node consul cluster
  • Start sending PUT requests for unique keys on each node (using curl or something else)
  • Disconnect the network between the nodes
  • Check if one of the nodes reported a HTTP-500 status code with the message 'leadership lost while committing log' (otherwise restart the test)
  • Repair the network
  • Check if the key which had a HTTP-500 response is in the consul kv store (We assumed it would not be there but (at least in some cases) it has been written).

Log Fragments or Link to gist

Log from our own application

2017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message [*3\r\n$3\r\nSET\r\n$20\r\nkey_n1_1491928360134\r\n$16\r\nn1_1491928360134\r\n]
2017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message as [LEADER]
2017-04-11T16:32:42.678Z [ClientHandler-92] DEBUG [ConsulKvStore] setKv failed: key [service/statestore/log/0/000000000000000018100000] value [[B@16f977c0]
com.ecwid.consul.v1.OperationException: OperationException(statusCode=500, statusMessage='Internal Server Error', statusContent='leadership lost while committing log')

Consul log from syslog

Apr 11 16:32:42 default-centos-7 consul[7046]: memberlist: Suspect n2 has failed, no acks received
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: Failed to contact 192.168.200.102:8300 in 2.500219397s
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: Failed to contact quorum of nodes, stepping down
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: Node at 192.168.200.101:8300 [Follower] entering Follower state (Leader: "")
Apr 11 16:32:42 default-centos-7 consul[7046]: consul: cluster leadership lost
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: aborting pipeline replication to peer {Voter 192.168.200.102:8300 192.168.200.102:8300}
Apr 11 16:32:42 default-centos-7 consul[7046]: consul.kvs: Apply failed: leadership lost while committing log
Apr 11 16:32:42 default-centos-7 consul[7046]: http: Request PUT /v1/kv/service/statestore/log/0/000000000000000018100000, error: leadership lost while committing log from=127.0.0.1:50176
Apr 11 16:32:45 default-centos-7 consul[7046]: memberlist: Suspect n2 has failed, no acks received
Apr 11 16:32:47 default-centos-7 consul[7046]: memberlist: Marking n2 as failed, suspect timeout reached (0 peer confirmations)
Apr 11 16:32:47 default-centos-7 consul[7046]: serf: EventMemberFailed: n2 192.168.200.102
Apr 11 16:32:47 default-centos-7 consul[7046]: consul: Removing LAN server n2 (Addr: tcp/192.168.200.102:8300) (DC: dc0)
Apr 11 16:32:49 default-centos-7 consul[7046]: serf: attempting reconnect to n2 192.168.200.102:8301
Apr 11 16:32:49 default-centos-7 consul[7046]: memberlist: Suspect n2 has failed, no acks received
Apr 11 16:32:50 default-centos-7 consul[7046]: agent: coordinate update error: No cluster leader
Apr 11 16:32:50 default-centos-7 consul[7046]: raft: Failed to heartbeat to 192.168.200.102:8300: read tcp 192.168.200.101:33154->192.168.200.102:8300: i/o timeout
Apr 11 16:32:51 default-centos-7 consul[7046]: raft: Heartbeat timeout from "" reached, starting election
Apr 11 16:32:51 default-centos-7 consul[7046]: raft: Node at 192.168.200.101:8300 [Candidate] entering Candidate state in term 3
Apr 11 16:32:57 default-centos-7 consul[7046]: raft: Election timeout reached, restarting election

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    type/questionNot an "enhancement" or "bug". Please post on discuss.hashicorp

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions