Skip to content

Stage buffer sometimes sticks around and doesn't ever get queued #4662

Open
@stanhu

Description

Describe the bug

I've been trying to track down what looks like a memory leak for the last week where a stage buffer doesn't get cleared out even though new data arrives. In my latest attempt to isolate the problem, I noticed a jump to 8 MB in the fluentd_output_status_buffer_stage_byte_size Prometheus metric, which measures the total bytes of the stage queue:

image

This jump appears to persist indefinitely until I restart fluentd.

To Reproduce

I'm still working on this.

Expected behavior

No memory growth over time.

Your Environment

- Fluentd version: v1.16.5
- Package version: 5.0.4-1
- Operating system: Ubuntu 20.04.6
- Kernel version: 5.15.0-1051-gcp

Your Configuration

I don't have a clear reproduction step yet. Our config looks something like this:

<source>
  @type tail
  tag postgres.postgres
  path /var/log/postgresql/postgresql.log
  pos_file /var/log/fluent/postgres.log.pos
  format /(?<time>[^G]*) GMT \[(?<pg_id>\d+), (?<xid>\d+)\]: .* user=(?<pg_user>[^,]*),db=(?<pg_db>[^,]*),app=(?<pg_application>[^,]*),client=(?<pg_client>[^ ]*) (?<pg_message>.*)/
  time_format %Y-%m-%d %H:%M:%S.%N
</source>

<filter postgres.postgres_csv>
  @type postgresql_slowlog
</filter>

<filter postgres.postgres_csv>
  @type postgresql_redactor
  max_length 200000
</filter>

<match postgres.*>
  @type copy
  <store>
    @type google_cloud
    label_map {
      "tag": "tag"
    }
    buffer_type file
    buffer_path /opt/fluent/buffers/postgres/google_cloud
    buffer_chunk_limit 8MB
    buffer_queue_limit 1000
    flush_interval 30s
    log_level info
  </store>

  <store>
    @type cloud_pubsub
    topic pubsub-postgres-inf-gprd
    project my-project
    buffer_type file
    buffer_path /opt/fluent/buffers/postgres/cloud_pubsub
    buffer_chunk_limit 8MB
    buffer_queue_limit 1000
    flush_interval 30s
  </store>
</match>

Your Error Log

The stuck 8MB buffer seems to have coincided with an EOF error:

  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer/file_chunk.rb:170:in `open'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/event.rb:318:in `each'
  2024-10-08 10:40:21 +0000 [error]: #0 /etc/fluent/plugin/out_cloud_pubsub.rb:62:in `write'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/compat/output.rb:131:in `write'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2024-10-08 10:40:21 +0000 [error]: #0 failed to purge buffer chunk chunk_id="623f4c358bd4b7cd7f63a4eb7410b459" error_class=Errno::ENOENT error=#<Errno::ENOENT: No such file or directory @ apply2files - /opt/fluent/buffers/postgres/cloud_pubsub/buffer.b623f4c358bd4b7cd7f63a4eb7410b459.log>
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer/file_chunk.rb:161:in `unlink'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer/file_chunk.rb:161:in `purge'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer.rb:601:in `block in purge_chunk'
  2024-10-08 10:40:21 +0000 [error]: #0 /opt/fluent/lib/ruby/3.2.0/monitor.rb:202:in `synchronize'
  2024-10-08 10:40:21 +0000 [error]: #0 /opt/fluent/lib/ruby/3.2.0/monitor.rb:202:in `mon_synchronize'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer.rb:592:in `purge_chunk'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1110:in `commit_write'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1229:in `try_flush'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2024-10-08 10:40:21.470273004 +0000 fluent.error: {"chunk_id":"623f4c358bd4b7cd7f63a4eb7410b459","error_class":"Errno::ENOENT","error":"#<Errno::ENOENT: No such file or directory @ apply2files - /opt/fluent/buffers/postgres/cloud_pubsub/buffer.b623f4c358bd4b7cd7f63a4eb7410b459.log>","message":"failed to purge buffer chunk chunk_id=\"623f4c358bd4b7cd7f63a4eb7410b459\" error_class=Errno::ENOENT error=#<Errno::ENOENT: No such file or directory @ apply2files - /opt/fluent/buffers/postgres/cloud_pubsub/buffer.b623f4c358bd4b7cd7f63a4eb7410b459.log>","tag":"fluent.error","environment":"gprd","hostname":"example.com","fqdn":"example.com","stage":"main","shard":"backup","tier":"db","type":"patroni"}

Additional context

Note that previously when log messages were up to 3 MB, I would see more of these "step" jumps in memory usage. I've altered our filters to truncate the log messages to 200K, which seems to have stopped most of these stage buffer leaks. But I'm still wondering if there is a corner case here where the file buffer got cleared but the stage buffer did not.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions