Skip to content

Potential memory leak on reloading model #72

Open
@jriegner

Description

Hey guys,

We use tfgo and notice an increase of memory usage each time our model gets reloaded. We have a running service which periodically checks whether the model got updated and reloads it. Now I wouldn't expect the memory usage to increase, since the model in memory should be replaced by the updated one.

The code to load the model is

// load model into memory
	model := tg.LoadModel(
		"path/to/our/model",
		[]string{
			"serve",
		},
		&tf.SessionOptions{},
	)

But our monitoring shows that the usage goes up every time the model gets reloaded (once per hour). I profiled the service with pprof and could not see that any of the internal components in our code has a significantly growing memory usage.

Furthermore I built tensorflow 2.9.1 with debug symbols and wrote a small go app just reloading the model. I did this to check for memory leaks with memleak-bpfcc from https://github.com/iovisor/bcc. This gave me the following stack trace, which, I believe, shows that there is memory leaked

	1770048 bytes in 9219 allocations from stack
		operator new(unsigned long)+0x19 [libstdc++.so.6.0.28]
		google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::New(google::protobuf::Arena*)+0x1c [libtensorflow_framework.so.2]
		google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::NewFromPrototype(tensorflow::NodeDef const*, google::protobuf::Arena*)+0x20 [libtensorflow_framework.so.2]
		google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type* google::protobuf::internal::RepeatedPtrFieldBase::Add<google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler>(google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type*)+0xc2 [libtensorflow_framework.so.2]
		google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::Add()+0x21 [libtensorflow_framework.so.2]
		tensorflow::FunctionDef::add_node_def()+0x20 [libtensorflow_framework.so.2]
		tensorflow::FunctionDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x334 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDef>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::FunctionDefLibrary::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x240 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDefLibrary>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDefLibrary*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::GraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x291 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::GraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::GraphDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::MetaGraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x325 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::MetaGraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::MetaGraphDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::SavedModel::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x25b [libtensorflow_framework.so.2]
		google::protobuf::MessageLite::MergeFromCodedStream(google::protobuf::io::CodedInputStream*)+0x32 [libtensorflow_framework.so.2]
		google::protobuf::MessageLite::ParseFromCodedStream(google::protobuf::io::CodedInputStream*)+0x3e [libtensorflow_framework.so.2]
		tensorflow::ReadBinaryProto(tensorflow::Env*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::MessageLite*)+0x141 [libtensorflow_framework.so.2]
		tensorflow::(anonymous namespace)::ReadSavedModel(absl::lts_20211102::string_view, tensorflow::SavedModel*)+0x136 [libtensorflow_framework.so.2]
		tensorflow::ReadMetaGraphDefFromSavedModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::MetaGraphDef*)+0x5d [libtensorflow_framework.so.2]
		tensorflow::LoadSavedModelInternal(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0x41 [libtensorflow_framework.so.2]
		tensorflow::LoadSavedModel(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0xc0 [libtensorflow_framework.so.2]
		TF_LoadSessionFromSavedModel+0x2a8 [libtensorflow.so]
		_cgo_6ae2e7a71f9a_Cfunc_TF_LoadSessionFromSavedModel+0x6e [testapp]
		runtime.asmcgocall.abi0+0x64 [testapp]
		github.com/galeone/tensorflow/tensorflow/go._Cfunc_TF_LoadSessionFromSavedModel.abi0+0x4d [testapp]
		github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel.func2+0x14f [testapp]
		github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel+0x2b6 [testapp]
		github.com/galeone/tfgo.LoadModel+0x6d [testapp]
		main.reloadModel+0x276 [testapp]
		main.main+0x72 [testapp]
		runtime.main+0x212 [testapp]
		runtime.goexit.abi0+0x1 [testapp]


As you can see this stacktrace shows calls to tfgo and to the underlying tensorflow library. I am not sure if I read it right, but it seems like there is a leak in tfgo or tensorflow itself.

Is there a way to explicitly release the memory of a loaded model when we reload? Could it be a problem in tfgo?
If you need more information on this, please tell me.

Thanks in advance :)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions