Notes from Gil Tene’s Talk: Understanding Java Garbage Collection and what you can do about it

This blog post has been a long time in the making. I’ve been a fan of Gil’s for some time and I’ve been meaning to go through this video and create notes, since I needed to really study this video for work. It’s been a while since I watched this and so I thought that these notes would be useful for somebody out there, not the least of whom is me. 🙂

If you’re a fan of the video, Understanding Java Garbage Collection and what you can do about it, I hope you will appreciate the notes below.

Understanding Java Garbage Collection and what you can do about it

This Talk’s Purpose / Goals

  • This talk is focussed on GC education
  • This is not a “how to use flags to tune a collector” talk
  • This is a talk about how the “GC machine” works
  • Purpose: Once you understand how it works, you can use your own brain…
  • You’ll learn just enough to be dangerous…
  • The “Azul makes the world’s greatest GC” stuff will only come at the end.

High level agenda

  • GC fundamentals and key mechanisms
  • Some GC terminology & metrics
  • Classifying currently available collectors
  • The “Application Memory Wall” problem
  • The C4 collector: What an actual solution looks like

Memory use

  • People using < 1/2GB largely for phones
  • Most people have heap sizes >= 1GB < 10GB, irrespective of application

Why should you care about GC?

Start with…

What is Garbage Collection good for?

  • Prevalent in modern languages and platforms
  • Productivity, stability
    • Programmers not responsible for freeing and destroying objects
      • GC frees up time to debug logic code, as opposed to lower level concerns like correctly calling destructors
    • Eliminates entire (common) areas of instability, delay, maintenance
  • Guaranteed interoperability
    • No “memory management contract” needed across APIs
      • Garbage collection implies a memory management contract common to all objects in the heap.
    • Uncoordinated libraries, frameworks, utilities seamlessly interoperate.
  • Facilitates practical use of large amounts of memory
    • Complex and intertwined data structures, in and across unrelated components
    • Interesting concurrent algorithms become practical…

Why should you understand (at least a little) how GC works?

The story of the good little architect

  • A good architect must, first and foremost, be able to impose their architectural choices on the project…
  • Early in Azul’s concurrent collector days, we encountered an application exhibiting 18 second pauses
    • Upon investigation, we found the collector was performing tens of millions of object finalizations per GC cycle
      • We have since made reference processing fully concurrent…
  • Every single class written in the project has a finalizer
    • The only work the finalizers did was nulling every reference field
  • The right discipline for a C++ ref-counting environment
    • The wrong discipline for a precise garbage collected environment

Most of what People seem to “know” about Garbage Collection is wrong

  • In many cases, it’s much better than you may think
    • GC is extremely efficient. Much more so than malloc()
    • Dead objects cost nothing to collect
    • GC will find all the dead objects (including cyclic graphs)
  • In many cases, it’s much worse than you may think
    • Yes, it really does stop for ~1 sec per live GB (in most JVMs)
    • No, GC does not mean you can’t have memory leaks
    • No, those pauses you eliminated from your 20 minute test are not gone
      • Need to be aware that your tuning is likely just delaying the inevitable pause that will happen – possibly making it worst than before when it does actually happen

Trying to solve GC problems in application architecture is like throwing knives

  • You probably shouldn’t do it blindfolded
  • It takes practice and understanding to get it right
  • You can get very good at it, but do you really want to?
    • Will all the code you leverage be as good as yours?
  • Examples:
    • Object pooling
    • Off heap storage
    • Distributed heaps
    • (In most cases, you end up building your own garbage collector)

Some GC Terminology (IMPORTANT SLIDES) !!!

A Basic Terminology example: What is a concurrent collector?

  • A Concurrent Collector performs garbage collection work concurrently with the application’s own execution
    • Concurrency exists relative to application
  • A Parallel Collector uses multiple CPUs to perform garbage collection
    • For a parallel collector, concurrency exists relative to itself (i.e. it has more than one thread running concurrently), but concurrency doesn’t necessarily exist relative to application (i.e. GC threads executing might require suspending ALL application threads).
    • The terms “Parallel” and “Concurrent” with respect to garbage collection are orthogonal terms.
  • A Stop-the-World collector performs garbage collection while the application is completely stopped
    • STW is opposite of concurrent
  • An Incremental collector performs a garbage collection operation or phase as a series of smaller discrete operations with (potentially long) gaps in between
    • Monolithic collection is effectively executing all incremental steps at once without letting application run in between increments.
  • Mostly means sometimes it isn’t (usually means a different fall back mechanism exists)
    • Need to read it opposite way
    • Mostly concurrent means sometimes STW.
    • Mostly incremental means sometimes Monolithic.
    • Mostly parallel means sometimes it’s not.

Precise vs. Conservative Collection

  • A Collector is Conservative if it is unaware of some object references at collection time, or is unsure about whether a field is a reference or not (or pointer vs. integer).
    • Conservative Collector means that it can’t move objects around as much because it doesn’t know size to copy.
    • Conservative collectors almost always run into fragmentation issues.
  • A collector is Precise (aka Accurate) if it can fully identify and process all object references at the time of collection
    • A collector MUST be precise in order to move objects
    • The COMPILERS need to produce a lot of information (OOP Maps)
  • All commercial server JVMs use precise collectors

Safepoints

  • A GC Safepoint is a point or range in a thread’s execution where the collector can identify all the references in that thread’s execution stack.
    • Safepoint and GC Safepoint are often used interchangeably.
    • But there are other types of safe points, including ones that require more information than a GC safe point does (e.g. de-optimization).
  • “Bringing a thread to a safe point” is the act of getting a thread to reach a safe point and not execute past it.
    • Close to, but not exactly the same as “stop at a safepoint”
      • e.g. JNI: you can keep running in, but not past the safe point
    • Safepoint opportunities are (or should be) frequent (i.e. micro-second frequent) (e.g. method boundaries and back-edges of loops).
  • In a Global Safepoint all threads are at a Safepoint
    • Global safe points are where STW pauses occur.

What’s common to all precise GC mechanisms?

  • Identify the live objects in the memory heap
  • Reclaim resources held by dead objects
  • Periodically relocate live objects
  • Examples:
    • Mark/Sweep/Compact (common for Old generations)
    • Copying collector (common for Young Generations)

Mark (aka “Trace”)

  • Start from “roots” (thread stack, statics, etc.)
  • “Paint” everything you can reach from “root” as “live”
  • At the end of a mark pass:
    • all reachable objects will be marked as “live”
    • all non-reachable objects will be marked “dead” (aka “non-live”)
  • Note: work is generally linear to “live set” (not size of heap) (i.e. if I’m only using 2GB of a 4GB heap then marking is only scans the 2GB of used objects, not the 4GB of the heap).

Sweep

  • Scan through the heap, identify “dead” objects and track them somehow
    • usually in some form of free list
  • Note: work is generally linear to heap size (not live set – unlike Mark phase).

Compact

  • Over time, heap will get “swiss cheesed”: contiguous dead space between objects may not be large enough to fit new objects (aka “fragmentation”)
    • Conservative collectors encounter this issue.
  • Compaction moves live objects together to reclaim contiguous empty space (aka “relocate”)
  • Compaction has to correct all object references to point to new object locations (aka “remap”)
    • Remapping is what makes compaction slow – because we need to change other live objects to point to new location.
    • You want to move a lot of objects as once so that you can change all references of existing live objects to point to new re-mapped objects, so that you don’t have to compact too many times.
  • Note: work is generally linear to live set

Copy

  • A copying collector moves all lives objects from a “from” space to a “to” space and reclaims “from” space.
    • Single pass operation
  • At start of copy, all objects are in “from” space and all references point to “from” space.
  • Start from “root” references, copy any reachable object to “to” space, correcting references as we go.
  • At end of copy, all objects are in “to” space, and all references point to “to” space.
  • Note: Work is generally linear to live set

Mark/Sweep/Compact, Copy, Mark/Compact

  • Copy requires twice the max size of live set in order to be reliable.
    • From space gets potentially all moved to To space without any removals.
  • Mark/Compact (no sweep) (typically) requires twice the max live set size in order to fully recover garbage in each cycle.
  • Mark/Sweep/Compact doesn’t need extra memory (just a little) because it happens in-place.
  • Copy and Mark/Compact are linear to live set size
  • Mark/Sweep/Compact is linear (in sweep) to heap size
  • Mark/Sweep/(Compact) may be able to avoid some moving work
  • Copying is typically monolithic

Generational Collection

  • Based on Weak Generational Hypothesis: Most objects die young
  • Focus collection efforts on young generation:
    • Use a moving collector: work is linear to live set
    • The live set in the young generation is a small % of the space of the heap
    • Promote objects that live long enough to older generations
  • Only collect older generations as they fill up
    • “Generational filter” reduces rate of allocation into older generations
  • Tends to be (order of magnitude) more efficient
    • Great way to keep up with high allocation rate
    • Practical necessity for keeping up with processor throughput

Generational Collection

  • Requires a “Remembered set”: a way to track all references into the young generation from the outside.
  • Remembered set is also part of “roots” for young generation collection
  • No need for twice the live set size: Can “spill over” to old generation
  • Usually want to keep surviving objects in young generation for a while before promoting them to the old generation.
    • Immediate promotion can significantly reduce generational filter efficiency.
    • Waiting too long to promote can eliminate generational benefits.

How does the remembered set work?

  • Generational collectors require a “Remembered set”: a way to track all references into the young generation from the outside.
  • Each store of a NewGen reference into and OldGen object needs to be intercepted and tracked
  • Common technique: “Card Marking”
    • A bit (or byte) indicating a word (or region) in OldGen is “suspect”.
  • Write barrier used to track references
    • Common technique (e.g. HotSpot): blind stores on reference write
    • Variants: precise vs. imprecise card marking, conditional vs. non-conditional

The typical combos in commercial server JVMs

  • Young generation usually uses a copying collector
  • Young generation is usually monolithic (STW)
  • Old generation usually uses Mark/Sweep/Compact
  • Old generation may be STW, or Concurrent, or mostly-Concurrent, or Incremental-STW, or mostly-Incremental-STW

Useful terms for discussing garbage collection

  • Mutator
    • Your program…
  • Parallel
    • Can use multiple CPUs
  • Concurrent
    • Runs concurrently with program
  • Pause
    • A time duration in which the mutator is not running any code
  • Stop-The-World (STW)
    • Something that is done in a pause
  • Monolithic Stop-The-World
    • Something that must be done in it’s entirety in a single pause
    • Note: This is usually the noticeable pause people see in apps.
  • Generational
    • Collects young objects and long lived objects separately.
  • Promotion
    • Allocation into old generation
  • Marking
    • Finding all live objects
  • Sweeping
    • Locating the dead objects
  • Compaction
    • Defragments heap
    • Moves objects in memory
    • Remaps all affected references
    • Frees contiguous memory regions

Useful metrics for discussing garbage collection

  • Heap population (aka Live set)
    • How much of your heap is alive
  • Allocation rate
    • How fast you allocate
  • Mutation rate
    • How fast your program updates references in memory
  • Heap Shape
    • The shape of the live set object graph
    • Hard to quantify as a metric…
  • Object Lifetime
    • How long objects live
  • Cycle time
    • How long it takes the collector to free up memory
  • Marking time
    • How long it takes the collector to find all live objects
  • Sweep time
    • How long it takes to locate dead objects
    • Relevant for Mark-Sweep
  • Compaction time
    • How long it takes to free up memory by relocating objects
    • Relevant for Mark-Compact

Empty memory and CPU/throughput

Heap Size vs. GC CPU

Two Intuitive limits

  • If we had exactly 1 byte of empty memory at all times, the collector would have to work “very hard”, and GC would take 100% of the CPU time
  • If we had infinite empty memory, we would never have to collect, and GC would take 0% of the CPU time
  • CPU% arising from GC cycles decays exponentially as you increase memory.

Empty memory needs (empty memory == CPU power)

  • The amount of empty memory in the heap is the dominant factor controlling the amount of GC work
  • For both Copy and Mark/Compact collectors, the amount of work per cycle is linear to live set
  • The amount of memory recovered per cycle is equal to the amount of unused memory (heap size) – (live set).
  • The collector has to perform a GC cycle when the empty memory runs out
  • A Copy or Mark/Compact collector’s efficiency doubles with every doubling of the empty memory.

What empty memory controls

  • Empty memory controls efficiency (amount of collector work needed per amount of application work performed).
  • Empty memory controls the frequency of pauses (if the collector performs any GTW operations).
  • Empty memory DOES NOT control pause times (only their frequency).
  • In Mark/Sweep/Compact collectors that pause for sweeping, more empty memory means less frequent but LARGER pauses.
  • This is bad for web services that need to be highly available.

Concurrent Marking

  • Mark all reachable objects as “live”, but object graph is “mutating” under us.
  • Classic concurrent marking race: mutator may move reference that has not yet been seen by the marker into an object that has already been visited.
    • If not intercepted or prevented in some way, will corrupt the heap.
  • Example technique: track mutations, multi-pass marking
    • Track reference mutations during mark (e.g. in card table)
    • Revisit all mutated references (and track new mutations)
    • When set is “small enough”, do a STW catch up (mostly concurrent)
  • Note: work grows with mutation rate, may fail to finish

Incremental Compaction

  • Track cross-region remembered sets (which region points to which)
  • To compact a single region, only need to scan regions that point into it to remap all potential references
  • Identify region sets that fit in limited time
    • Each such set of regions is a STW increment
    • Safe to run application between (but not within) increments
    • Incremental compaction is sensitive to the popular objects (i.e. if a region contains a popular object, referenced by many regions, we won’t be able to find many regions that are only pointed to by a small number of other regions).
  • Note: work can grow quadratically relative to heap size.
    • The number of regions pointing into a single region is generally linear to the heap size (the number of regions in the heap)

Delaying the inevitable

  • Some form of copying/compaction is inevitable in practice
    • And compacting anything requires scanning/fixing all references to it
  • Delay tactics focus on getting “easy empty space” first
    • This is the focus for the vast majority of GC tuning
  • Most objects die young (Generational)
    • So collect young objects only, as much as possible. Hope for short STW pause that occurs infrequently.
    • But eventually, some old dead objects must be reclaimed.
  • Most old dead space can be reclaimed without moving it
    • Track dead space in lists, and reuse it in place (e.g. CMS)
    • But eventually, space gets fragmented and needs to be moved/defragged
  • Much of the heap is not “popular” (e.g. G1, “Balanced”)
    • A non-popular region will only be pointed to from a small percentage of the heap.
    • So compact non-popular regions occur in short STW pauses
    • But eventually popular objects and regions need to be compacted.

Classifying Common Collectors

The typical combos in commercial server JVMs

  • Young generation usually uses a copying collector
    • Young generation GC is usually monolithic, STW GC.
  • Old generation usually uses a Mark/Sweep/Compact collector.
    • Old generation may be STW or Concurrent or mostly-Concurrent or Incremental STW or mostly-Incremental STW

Hotspot ParallelGC Collector mechanism classification

  • Monolithic STW copying NewGen
  • Monolithic STW Mark/Sweep/Compact OldGen

HotSpot ConcMarkSweepGC (aka CMS) Collector mechanism classification

  • Monolithic STW copying NewGen (ParNew)
  • Mostly Concurrent, non-compacting OldGen (CMS)
    • Mostly concurrent marking
      • Mark concurrently while mutator is running (previously discussed multi-pass technique)
      • Track mutations in card marks
      • Revisit mutated cards (repeat as needed)
      • STW to catch up on mutations, ref processing, etc.
    • Concurrent Sweeping
    • Does not Compact (maintains free list, does not move objects)
  • Fallback to Full Collection (Monolithic STW)
    • Used for compaction, etc.
    • When you see promotion failure message or “concurrent mode failure” it means that CMS NewGen collector can’t push the object from NewGen to OldGen because OldGen doesn’t have a Swiss Cheese sized hole big enough to accommodate this new object so it falls back to full STW pause to clean up OldGen.

HotSpot G1GC (aka “Garbage First”) Collector mechanism classification

  • Monolithic STW copying NewGen
  • Mostly concurrently, OldGen marker (slightly different than CMS)
    • Mostly concurrent marking
      • STW to catch up on mutations, reference processing, etc.
    • Tracks inter-region relationships in remembered sets
  • STW mostly incremental compacting OldGen (instead of trying to be concurrent like CMS)
    • Objective: “Avoid, as much as possible, having a Full GC…”
    • Compact sets of regions that can be scanned in limited time
    • Delay compaction of popular objects, popular regions
  • Fallback to Full Collection (Monolithic STW)
    • Used for compacting popular objects, popular regions, etc.

The “Application Memory Wall”

Memory Use

  • Why are we all using between 1GB and 4GB?

Reality check: servers in 2012

  • Retail prices, major web server store (USD$, Oct. 2012)
    • 24 vCore, 128GB server ~= $5k (most of us use less than half of this server)
    • 24 vCore, 256GB server ~= $8k
    • 32 vCore, 384GB server ~= $14k
    • 48 vCore, 512GB server ~= $19k
    • 64 vCore, 1TB server ~= $36k
    • Cheap (< $1/GB/month), and roughly linear ~1TB
  • 10s to 100s of GB/sec of memory bandwidth
  • If no apps need that much memory, nobody will build the servers.

The Application Memory Wall A simple observation:

  • Application instances appear to be unable to make effective use of modern server memory capacities. * The size of application instances as a % of a server’s capacity is rapidly dropping.
  • This is happening at approx. Moore’s Law rates.

How much memory do applications need?

  • Famous quote (that Bill G. didn’t actually say) “640KB ought to be enough for anybody”.
    • WRONG!
  • So what’s the right number?
    • 6.4MB?
    • 64MB?
    • 640MB?
    • 6.4GB?
    • 64GB?
  • There is no right number
  • Target moves at 50x-100x per decade

“Tiny” application history

  • “Tiny” means that it’s silly to break it into pieces Tiny Application History

What is causing the Application Memory Wall?

  • Garbage Collection is a clear and dominant cause
  • There seems to be practical heap size limits for applications with responsiveness requirements.
    • We build apps that respond to users, we can’t wait for a 3min long STW wait time.
    • Certainly with ms SLA web services, we need to keep heap size small.
  • [Virtually] All current commercial JVMs will exhibit a multi-second pause on a normally utilized 2-6GB heap.
    • It’s a question of “When” and “How often”, not “If”.
    • GC tuning only moves the “when” and the “how often” around.
  • Root cause: The link between scale and responsiveness.

What quality of GC is responsible for the Application Memory Wall?

  • It is NOT about overhead or efficiency:
    • CPU utilization, bottlenecks, memory consumption and utilization
  • It is NOT about speed
    • Average speeds, 90%, 95% speeds, are all perfectly fine
  • It is NOT about minor GC events (right now)
    • GC events in the 10s of msec are usually tolerable for most apps
  • It is NOT about the frequency of very large pauses
  • It is ALL about the worst observable pause behavior
    • People avoid building/deploying visibly broken systems

GC Problems

Framing the discussion: Garbage Collection at modern server scales

  • Modern servers have hundreds of GB of memory
  • Each modern x86 core (when actually used) produces garbage at a rate of 0.25 to 0.5+ GB/sec
  • That’s many GB/sec of allocation in a server
  • Monolithic STW operations are the cause of the current Application Memory Wall
    • Even if they are done “only a few times a day”

One way to deal with Monolithic-STW GC

  • Stick your head in the sand.

Another way to cope: Use “Creative Language”

  • “Guarantee a worst case of X msec, 99% of the time”
  • “Mostly” Concurrent, “Mostly” Incremental
    • Translation: “Will at times exhibit long monolithic STW pauses”
  • “Fairly Consistent”
    • Translation: “Will sometimes show results well outside this range”
  • “Typical pauses in the tens of msecs”
    • Translation: “Some pauses are much longer than tens of msecs”

Actually measuring things: jHiccup

How can we break through the Application Memory Wall?

We need to solve the right problems

  • Focus on the causes of the Application Memory Wall
    • Scale is artificially limited by responsiveness
  • Responsiveness must be unlinked from scale:
    • Heap size, Live Set size, Allocations rate, Mutation rate
    • Responsiveness must be continually sustainable
    • Can’t ignore “rare” events
  • Eliminate all STW Fallback
    • At modern server scales, any STW fall back is a failure.

The things that seem “hard” to do in GC

  • Robust concurrent marking
    • References keep changing
    • Multi-pass marking is sensitive to mutation rate
    • Weak, soft, final references are “hard” to deal with concurrently.
  • [Concurrent] Compaction…
    • It’s not the moving of the objects…
    • It’s the fixing of all those references that point to them
    • How do you deal with a mutator looking at a stale reference?
    • If you can’t, then remapping is a [monolithic] STW operation
  • Young Generation collection at scale
    • Young Generation collection is generally monolithic STW
    • Young Generation pauses are only small because heaps are tiny
    • A 100GB heap will regularly have GB of live young stuff

The problems that need solving (areas where the state of the art needs improvement)

  • Robust Concurrent Collecting
    • In the presence of high mutation and allocation rates
    • Cover modern runtime semantics (e.g. weak refs, lock deflation)
  • Compaction that is not monolithic STW
    • e.g. stay responsive while compacting 0.25TB heaps
    • Must be robust: not just a tactic to delay STW compaction
    • Current “incremental STW” attempts fall short on robustness
  • Young-Geo that is not monolithic STW
    • Stay responsive while promoting multi-GB data spikes
    • Concurrent or “incremental STW” may both be ok
    • Surprisingly little work done in this specific area

Azul’s “C4” Collector Continuously Concurrent Compacting Collector

  • Concurrent guaranteed-single-pass marker
    • Oblivious to mutation rate
    • Concurrent ref (weak, soft, finalizer) processing
  • Concurrent Compactor
    • Objects moved without stopping mutator
    • References remapped without stopping mutator
    • Can relocate entire generation (New, Old) in every GC cycle
  • Concurrent compacting old generation
  • Concurrent compacting new generation
  • No STW fallback
    • Always compacts and does so concurrently

C4 Algorithm Highlights

  • Same core mechanism used for both generations
    • Concurrent Mark-Compact
  • A loaded value barrier (LVB) is central to the algorithm
    • Every heap reference is verified as “sane” when loaded
    • “Non-sane” refs are caught and fixed in a self-healing barrier
  • Refs that have not yet been “marked through” are caught
    • Guaranteed single pass concurrent marker
  • Refs that point to relocated objects are caught
    • Lazily (and concurrently) remap refs, no hurry
    • Relocation and remapping are both concurrent
  • Uses “quick release” to recycle memory
    • Forwarding information is kept outside of object pages
    • Physical memory released immediately upon relocation
    • “Hand-over hand” compaction without requiring empty memory

Sample responsiveness behavior

  • Not seeing efficiency, just response time
  • You want a nice flat plateau Sample Responsiveness Behavior
  • SpecJBB + Slow churning 2GB LRU Cache
  • Live set is ~2.5GB across all measurements
  • Allocation rate is ~1.2GB/sec across all measurements

Zing

  • A JVM for Linux/x86 servers
  • ELIMINATES Garbage Collection as a concern for enterprise applications
  • Decouples scale metrics from response time concerns
    • Transaction rate, data set size, concurrent users, heap size, allocation rate, mutation rate, etc.
  • Leverages elastic memory for resilient operation

Sustainable Throughput: The throughput achieved while safely maintaining service levels

  • Key Takeaway: Because Zing/C4 is truly concurrent and compacting, the cap on heap size (i.e. application memory wall) large crumbles, enabling us to make heap size a greater ratio of total memory available for a host.

Instance capacity test: “Fat Portal” – HotSpot CMS: Peaks at ~3GB / 45 concurrent users

Fat Portal Hotspot

Instance capacity test: “Fat Portal” C4: still smooth @800 concurrent users

  • Used 50GB heap, no need to way for multi-second STW pauses Fat Portal Zing
  • Zing is just far faster

Java GC tuning is “hard”

  • also prone to degrading performance as future versions take over previous versions

The complete guide to Zing GC tuning

  • java -Xmx40g

What you can expect (from Zing) in the low latency world

  • Assuming individual transaction work is “short” (on the order of 1msec), and assuming you don’t have hundreds of runnable threads competing for 10 cores…
  • “Easily” get your application to <10msec worst case
  • With some tuning, 2-3 msec worst case
  • Can go to below 1msec worst case…
    • May require heavy tuning/tweaking
    • Mileage WILL vary

Tuned Oracle HotSpot (Highly Tuned to use only NewGen collector restarting JVM every night so that there is no major GC) vs. Zing

Hotspot vs. Zing

Azul initiative to Support Open Source Community

  • Free access to Zing JVM for Open Source developers and projects
  • For use in development, qualification, and testing
  • For x86 servers running Red Hat Enterprise Linux, SUSE Linux Enterprise Server, CentOS and Ubuntu Linux