Skip to content

[feature] More efficient handling of sparse files #41

@vassilit

Description

fuse-archive uses archive_read_data() to get archive content. Gaps are filled with nulls and fuse-archive has no idea about them.
It seems that operations on big sparse files could be improved.

% tar xvf test/data/sparse.tar
sparse
% time cp sparse sparse.copy
cp sparse sparse.copy  0.00s user 0.00s system 64% cpu 0.002 total
% time cp sparse sparse.copy
cp sparse sparse.copy  0.00s user 0.00s system 63% cpu 0.001 total
% du sparse.copy
4	sparse.copy

% out/fuse-archive test/data/sparse.tar mnt
fuse-archive: Created mount point 'mnt'
% time cp mnt/sparse sparse.copy
cp mnt/sparse sparse.copy  0.00s user 0.36s system 47% cpu 0.735 total
% time cp mnt/sparse sparse.copy
cp mnt/sparse sparse.copy  0.00s user 0.44s system 52% cpu 0.839 total
% du sparse.copy
1048576	sparse.copy

For some reason, the first invocation on the file inside the mounted archive is faster than the following ones.
For the simple file, the fact that the second invocation is slightly faster is probably due to the kernel cache.
With fuse-archive -o kernel_cache, the second invocation is faster as well:

% out/fuse-archive -o kernel_cache test/data/sparse.tar mnt
fuse-archive: Created mount point 'mnt'
% time cp mnt/sparse sparse.copy
cp mnt/sparse sparse.copy  0.00s user 0.34s system 46% cpu 0.738 total
% time cp mnt/sparse sparse.copy
cp mnt/sparse sparse.copy  0.00s user 0.17s system 34% cpu 0.491 total

Using directly archive_read_data_block() would bring some benefits, such as:

  • support SEEK_HOLE and SEEK_DATA (through FUSE_LSEEK)
  • more efficient read operation with tools that support sparseness (coreutils, database, VM, etc)
  • possibly more efficient sequential read operation in general on big sparse files (probably not, the zeros would have to be put in memory by fuse-archive instead of libarchive, but they would be there anyway)
  • report st_blocks that would mean something useful
  • for some tools, output files would be sparse as well, reducing disk usage and being closer to the original file in the archive.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions