Stage buffer sometimes sticks around and doesn't ever get queued

### Describe the bug

I've been trying to track down what looks like a memory leak for the last week where a stage buffer doesn't get cleared out even though new data arrives. In my latest attempt to isolate the problem, I noticed a jump to 8 MB in the `fluentd_output_status_buffer_stage_byte_size` Prometheus metric, which measures the total bytes of the stage queue:

<img width="1706" alt="image" src="https://github.com/user-attachments/assets/8b71ada3-aea5-4b92-b9b6-80ce2ec3be20">

This jump appears to persist indefinitely until I restart fluentd.

### To Reproduce

I'm still working on this.

### Expected behavior

No memory growth over time.

### Your Environment

```markdown
- Fluentd version: v1.16.5
- Package version: 5.0.4-1
- Operating system: Ubuntu 20.04.6
- Kernel version: 5.15.0-1051-gcp
```


### Your Configuration

I don't have a clear reproduction step yet. Our config looks something like this:

```
<source>
  @type tail
  tag postgres.postgres
  path /var/log/postgresql/postgresql.log
  pos_file /var/log/fluent/postgres.log.pos
  format /(?<time>[^G]*) GMT \[(?<pg_id>\d+), (?<xid>\d+)\]: .* user=(?<pg_user>[^,]*),db=(?<pg_db>[^,]*),app=(?<pg_application>[^,]*),client=(?<pg_client>[^ ]*) (?<pg_message>.*)/
  time_format %Y-%m-%d %H:%M:%S.%N
</source>

<filter postgres.postgres_csv>
  @type postgresql_slowlog
</filter>

<filter postgres.postgres_csv>
  @type postgresql_redactor
  max_length 200000
</filter>

<match postgres.*>
  @type copy
  <store>
    @type google_cloud
    label_map {
      "tag": "tag"
    }
    buffer_type file
    buffer_path /opt/fluent/buffers/postgres/google_cloud
    buffer_chunk_limit 8MB
    buffer_queue_limit 1000
    flush_interval 30s
    log_level info
  </store>

  <store>
    @type cloud_pubsub
    topic pubsub-postgres-inf-gprd
    project my-project
    buffer_type file
    buffer_path /opt/fluent/buffers/postgres/cloud_pubsub
    buffer_chunk_limit 8MB
    buffer_queue_limit 1000
    flush_interval 30s
  </store>
</match>
```


### Your Error Log

The stuck 8MB buffer seems to have coincided with an EOF error:

```shell
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer/file_chunk.rb:170:in `open'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/event.rb:318:in `each'
  2024-10-08 10:40:21 +0000 [error]: #0 /etc/fluent/plugin/out_cloud_pubsub.rb:62:in `write'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/compat/output.rb:131:in `write'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2024-10-08 10:40:21 +0000 [error]: #0 failed to purge buffer chunk chunk_id="623f4c358bd4b7cd7f63a4eb7410b459" error_class=Errno::ENOENT error=#<Errno::ENOENT: No such file or directory @ apply2files - /opt/fluent/buffers/postgres/cloud_pubsub/buffer.b623f4c358bd4b7cd7f63a4eb7410b459.log>
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer/file_chunk.rb:161:in `unlink'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer/file_chunk.rb:161:in `purge'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer.rb:601:in `block in purge_chunk'
  2024-10-08 10:40:21 +0000 [error]: #0 /opt/fluent/lib/ruby/3.2.0/monitor.rb:202:in `synchronize'
  2024-10-08 10:40:21 +0000 [error]: #0 /opt/fluent/lib/ruby/3.2.0/monitor.rb:202:in `mon_synchronize'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer.rb:592:in `purge_chunk'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1110:in `commit_write'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1229:in `try_flush'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2024-10-08 10:40:21 +0000 [error]: #0 /var/lib/fluent/vendor/bundle/ruby/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2024-10-08 10:40:21.470273004 +0000 fluent.error: {"chunk_id":"623f4c358bd4b7cd7f63a4eb7410b459","error_class":"Errno::ENOENT","error":"#<Errno::ENOENT: No such file or directory @ apply2files - /opt/fluent/buffers/postgres/cloud_pubsub/buffer.b623f4c358bd4b7cd7f63a4eb7410b459.log>","message":"failed to purge buffer chunk chunk_id=\"623f4c358bd4b7cd7f63a4eb7410b459\" error_class=Errno::ENOENT error=#<Errno::ENOENT: No such file or directory @ apply2files - /opt/fluent/buffers/postgres/cloud_pubsub/buffer.b623f4c358bd4b7cd7f63a4eb7410b459.log>","tag":"fluent.error","environment":"gprd","hostname":"example.com","fqdn":"example.com","stage":"main","shard":"backup","tier":"db","type":"patroni"}
```

### Additional context

Note that previously when log messages were up to 3 MB, I would see more of these "step" jumps in memory usage. I've altered our filters to truncate the log messages to 200K, which seems to have stopped most of these stage buffer leaks. But I'm still wondering if there is a corner case here where the file buffer got cleared but the stage buffer did not.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage buffer sometimes sticks around and doesn't ever get queued #4662

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development