Devel::MAT Memory Leak Analysis - Perl
A look at a slow memory growth case and how recently-added features in Devel::MAT
version 0.38 can help get to the bottom of it.
Slow Growth Over Time
Given a process that grows a small amount of memory over time, it can sometimes be difficult to look for the cause of this with only one heapdump file containing just one frozen moment in time. The size of the overall process could be large as compared to the small amount that it grows by. Unless the process is very old and thus has accumulated a lot of growth, it won't easily become apparent where the problem is.
Alternatively, we can find all the SVs in a heap dump that appear to be memory leaks, and see if we can identify some common pattern among them which may point towards an explanation.
The nature of a leaking SV is one that gets allocated at some time, and thereafter is never touched - it always hangs around, having fundamentally the same type at the same address. A newly-added script in the Devel-MAT
distribution looks for SVs matching this pattern. To run this, we will need a sequence of heapdump files taken from the process.
Automated Heapdumps
The case we're looking at today involves a suspected leak in bom-feed-distributor
, because the process grows in memory usage over time. This was observed by periodic inspection of the VSZ
field in the output of the ps
utility.
To get some heapdumps out of this program we start by editing its main script, adding a line to load Devel::MAT::Dumper
configured to write a heapdump file on receipt of a SIGQUIT
signal. This signal handler is designed specifically for this purpose, because because it doesn't stop the process once it's finished, but instead resumes running normally after the file is written.
use Devel::MAT::Dumper qw( -dump_at_SIGQUIT -file /tmp/bom-feed-distributor-NNN.pmat );
Whenever we send a SIGQUIT
signal we'll get a new serial-numbered heapdump file in /tmp
. We can automate this to run regularly, and so leave it running overnight to collect up some files for analysis later:
$ PID=$(pgrep -f bom-feed-distributor)
$ while sleep $PID; do kill -QUIT 30829; done
Differential Analysis
Now that we have a collection of .pmat
files we can apply the pmat-leakreport
tool to analyse them. It needs at least three heapdump files to work with, but the more it has the better its analysis will be. It works by comparing the first two files to look for every newly-allocated SV that appears in the second that wasn't in the first. These become the "leak candidates". Then for every subsequent file it looks to see if any of those candidates get released or changed in type - in either case it turns out that candidate wasn't a leak after all. Once it has finished invalidating cases by looking at the remaining files, whatever is still left as a candidate is printed in a list. These are usually good places to continue investigating.
$ pmat-leakreport -1 bom-feed-distributor-2*.pmat >leakreport.txt
We'll take the entire series of ten files numbered 20
to 29
, as they'll be sufficiently far from the startup of the process that it should be settled in nicely, so as to minimise any false positives. We direct the output to a file in case the leak report is quite long - we'll be wanting to refer back to it repeatedly. Generating it is computationally-intensive, so it's nice to keep it on disk to refer back to.
Looking at the start of this file we can see a number of SVs that might make interesting candidates to investigate:
$ head -10 leakreport-2.txt
LEAK[1] SCALAR(UV,NV,PV) at 0x40ed0d0
LEAK[1] SCALAR(UV) at 0x5ed26d8
LEAK[1] REF() at 0x5f76ec8
LEAK[1] SCALAR(PV) at 0x5ecbf88
LEAK[1] SCALAR(UV) at 0x5e6e9d8
LEAK[1] REF() at 0x5fad540
LEAK[1] REF() at 0x5e59228
LEAK[1] REF() at 0x5fac498
LEAK[1] SCALAR(UV) at 0x5fd15d8
LEAK[1] SCALAR(UV) at 0x5f18998
Lets start by picking one of these and seeing what it is. Since these SVs are ones that have remained undisturbed until the end of the sequence of .pmat
files, we can load the latest one into the pmat
shell and try to identify it:
$ pmat bom-feed-distributor-29.pmat
Perl memory dumpfile from perl 5.14.2
Heap contains 266212 objects
pmat> identify 0x5f76ec8
REF() at 0x5f76ec8 is:
└─not found
Well this is already somewhat exciting. A leaked SV that identify
isn't able to find. This may be because it is no longer referenced from anywhere but still has a nonzero refcount, or because it's referenced by something that Devel::MAT::Dumper
can't see.
Unidentifiable SVs
Lets try to work out what this reference is, to see if we can learn more about it:
pmat> show 0x5f76ec8
REF() at 0x5f76ec8 with refcount 1
size 24 bytes
RV=HASH(14)=Date::Utility at 0x5f60f88
It's a Date::Utility
instance. Lets take a look at another SV and see if we can find a pattern:
pmat> identify 0x5ed26d8
SCALAR(UV) at 0x5ed26d8 is:
└─value {month} of HASH(14)=Date::Utility at 0x5f02008, which is:
└─the referrant of REF() at 0x60c9730, which is:
└─not found
pmat> identify 0x40ed0d0
SCALAR(UV,NV,PV) at 0x40ed0d0 is:
└─value {epoch} of HASH(14)=Date::Utility at 0x5ef57b8, which is:
└─the referrant of REF() at 0x5f7f910, which is:
└─not found
A pattern begins to emerge. Most of these leaking SVs are related to Date::Utility
instances. Perhaps there are quite a few in this file - we should count the total of them. A newly-added ability in Devel::MAT
0.38 allows us to do this, by using the --count
option to the find
command. We ask it to count the number of SVs blessed into this package:
pmat> find --count blessed Date::Utility
Total: 308 SVs
That's quite a few instances of a utility object class to have hanging around a server instance. We can use the previous files to discover the history here; using the --quiet
option to the pmat
shell to reduce unrelated noise in the output:
$ for F in bom-feed-distributor-*.pmat; do echo -n "$F: "; pmat --quiet $F find --count blessed Date::Utility; done
bom-feed-distributor-00.pmat: Total: 22 SVs
bom-feed-distributor-01.pmat: Total: 45 SVs
bom-feed-distributor-02.pmat: Total: 301 SVs
bom-feed-distributor-03.pmat: Total: 301 SVs
bom-feed-distributor-04.pmat: Total: 301 SVs
bom-feed-distributor-05.pmat: Total: 301 SVs
bom-feed-distributor-06.pmat: Total: 301 SVs
bom-feed-distributor-07.pmat: Total: 301 SVs
...
bom-feed-distributor-29.pmat: Total: 308 SVs
It seems that these objects accumulate fairly early on in program startup, over the first few hours, and thereafter remain relatively stable at just over 300.
Clues In Source Code
At this point, I'll admit I got stuck trying to think of what might be going on, so I asked around to see if anyone else had ideas. Tom Molesworth observed the following in Date/Utility.pm
line 50:
my $lru = tie %popular, 'Tie::Hash::LRU', 300;
Here we see a cache implemented by the Tie::Hash::LRU
package, which keeps 300 objects around. That sounds like a very likely cause of the issue we see here. Furthermore, this package uses an XS-based implementation and so may well have Perl object references stored somewhere inside C-based code where Devel::MAT::Dumper
can't see. This would explain why identify
was unable to provide any further information about them.
Here we reach an eventual conclusion that is perhaps not quite as exciting as in previous articles. We've discovered that, although the memory requirement of this program does slowly increase initially, it does eventually reach a stable level after it has been running for some time. These Date::Utility
instances do hang around for a while, but intentionally so as part of a caching mechanism designed to reduce object churn for commonly used values. This is not actually a memory leak as such, but instead an effect of object caching.
Additionally, we've seen that there are limits to how far Devel::MAT::Dumper
can see. Specifically, that things like XS-implemented modules can confuse it by breaking visibility of the relationship between SVs that live on the heap, causing us not to see certain details.
Finally, it's worth repeating the fact that sometimes when analysing a problem and you get stuck, talking it over with co-workers and colleagues can often reveal new hints of idea that can help resolve the issue.
Note: This post has been ported from https://tech.binary.com/ (our old tech blog). Author of this post is https://metacpan.org/author/PEVANS