Skip to content

Improve the performance of Juicefs gc #5671

Open
@SonglinLife

Description

When dealing with a large number of pending deleted files, we typically use juicefs gc to remove these files. However, the juicefs gc command invokes the scanPendingFiles function, where each soft - deleted file is processed sequentially.

juicefs/pkg/meta/tkv.go

Lines 2584 to 2594 in 2dd3897

for key, value := range pairs {
if len(key) != klen {
return fmt.Errorf("invalid key %x", key)
}
ino := m.decodeInode([]byte(key)[1:9])
size := binary.BigEndian.Uint64([]byte(key)[9:])
ts := m.parseInt64(value)
clean, err := scan(ino, size, ts)
if err != nil {
return err
}

The issue is that if there are a large number of files with relatively small individual sizes, the file deletion speed will drop significantly. Although gc provides the --threads parameter, in reality, this parameter controls the parallel deletion speed within a single file.

I think this is how we can solve it in tkv.go: process each file pending deletion in parallel. The code is provided below. Please help me review it. :)

	batchSize := 1000000

	threads := min(1, m.conf.MaxDeletes/3)
	deleteFileChan := make(chan pair, threads)
	var wg sync.WaitGroup

	for i := 0; i < threads; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			for pair := range deleteFileChan {
				key, value := pair.key, pair.value
				if len(key) != klen {
					logger.Errorf("invalid key %x", key)
					continue
				}
				ino := m.decodeInode([]byte(key)[1:9])
				size := binary.BigEndian.Uint64([]byte(key)[9:])
				ts := m.parseInt64(value)
				clean, err := scan(ino, size, ts)
				if err != nil {
					logger.Errorf("scan pending deleted files: %s", err)
					continue
				}
				if clean {
					m.doDeleteFileData(ino, size)
				}
			}
		}()
	}

	prefixKey := m.fmtKey("D")
	endKey := nextKey(prefixKey)
	for {
		keys, values, err := m.scan(prefixKey, endKey, batchSize, func(k, v []byte) bool {
			return len(k) == klen
		})
		if len(keys) == 0 {
			break
		}
		if err != nil {
			close(deleteFileChan)
			wg.Wait()
			return err
		}
		prefixKey = keys[len(keys)-1]

		for index, key := range keys {
			deleteFileChan <- pair{key, values[index]}
		}

		if len(keys) < batchSize {
			break
		}
	}

	close(deleteFileChan)
	wg.Wait()
	return nil

And, I tested the results and found that it can improve the gc performance by 10x in the scenario of deleting small files.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions