Description
When dealing with a large number of pending deleted files, we typically use juicefs gc
to remove these files. However, the juicefs gc command invokes the scanPendingFiles function, where each soft - deleted file is processed sequentially.
Lines 2584 to 2594 in 2dd3897
The issue is that if there are a large number of files with relatively small individual sizes, the file deletion speed will drop significantly. Although gc provides the --threads parameter, in reality, this parameter controls the parallel deletion speed within a single file.
I think this is how we can solve it in tkv.go: process each file pending deletion in parallel. The code is provided below. Please help me review it. :)
batchSize := 1000000
threads := min(1, m.conf.MaxDeletes/3)
deleteFileChan := make(chan pair, threads)
var wg sync.WaitGroup
for i := 0; i < threads; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for pair := range deleteFileChan {
key, value := pair.key, pair.value
if len(key) != klen {
logger.Errorf("invalid key %x", key)
continue
}
ino := m.decodeInode([]byte(key)[1:9])
size := binary.BigEndian.Uint64([]byte(key)[9:])
ts := m.parseInt64(value)
clean, err := scan(ino, size, ts)
if err != nil {
logger.Errorf("scan pending deleted files: %s", err)
continue
}
if clean {
m.doDeleteFileData(ino, size)
}
}
}()
}
prefixKey := m.fmtKey("D")
endKey := nextKey(prefixKey)
for {
keys, values, err := m.scan(prefixKey, endKey, batchSize, func(k, v []byte) bool {
return len(k) == klen
})
if len(keys) == 0 {
break
}
if err != nil {
close(deleteFileChan)
wg.Wait()
return err
}
prefixKey = keys[len(keys)-1]
for index, key := range keys {
deleteFileChan <- pair{key, values[index]}
}
if len(keys) < batchSize {
break
}
}
close(deleteFileChan)
wg.Wait()
return nil
And, I tested the results and found that it can improve the gc performance by 10x in the scenario of deleting small files.
Activity