ANR Troubleshooting Practice: Log Analysis and Tool Practice

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5168

    #1

    ANR Troubleshooting Practice: Log Analysis and Tool Practice


    Troubleshooting ANR is not about luck, but about methodology—master the methodology, and you can be like a detective, finding the truth from clues.


    Introduction: An Emergency Troubleshooting at 2 AM

    At 2 AM, the phone suddenly rang—ANR rate in production environment skyrocketed, users complaining frantically.


    Opening the monitoring platform, ANR count jumped from single digits to hundreds. I immediately realized this wasn't an ordinary bug, but a systemic failure.


    Getting traces.txt, the first thing I saw was:






    "main" prio=5 tid=1 Blocked
    - waiting to lock 0x0a1b2c3d> (a java.lang.Object)







    Main thread waiting for a lock, but where's the lock-holding thread? Continue scrolling:






    "WorkerThread-5" prio=5 tid=15 Native
    at android.os.BinderProxy.transactNative(Native Method)
    - locked 0x0a1b2c3d> (a java.lang.Object)







    Thread 15 holds the lock and is making a Binder call. Looking at logcat, found massive "Binder transaction failed" errors. Finally located: a third-party SDK was frantically registering ContentObserver in the background, exhausting Binder resources, causing all Binder calls to timeout, triggering widespread ANR.


    That night, I deeply understood: ANR troubleshooting is not about guessing, but about systematic methodology and toolchain.


    In this article, I will share this methodology with you. After reading, you will learn:

    1. Understand every detail of traces.txt
    2. Analyze logcat logs' timeline and causality
    3. Use tools like adb and systrace for deep localization
    4. Master 3 complete troubleshooting flow templates
    5. Write automation analysis scripts to improve efficiency


    Ready? Let's begin this ANR troubleshooting journey!





    I. ANR Troubleshooting Methodology

    Before getting started, we need to establish a systematic troubleshooting approach. ANR troubleshooting is not "treating the head when it hurts, treating the foot when it hurts," but following methodology step by step.


    1.1 Core Troubleshooting Approach

    ANR troubleshooting can be summarized as "Three Looks, Four Questions":


    Three Looks:

    1. Look at the Scene - traces.txt (thread stacks)
    2. Look at Logs - logcat (contextual information)
    3. Look at Data - CPU, memory, Binder and other system resources


    Four Questions:

    1. Who - Which process/thread has the problem?
    2. Where - Which line of code is executing?
    3. What - What operation is being performed?
    4. Why - Why is it stuck?


    1.2 Troubleshooting Flow Overview




    Flow Description:

    1. Obtain Logs - traces.txt + logcat + system logs
    2. Locate Main Thread - Find the ANR process's main thread
    3. Analyze Thread State - Blocked/Waiting/Native/Runnable
    4. Trace Call Chain - Backtrack from stack to business code
    5. Find Root Cause - Lock contention/Binder timeout/Dead loop/Time-consuming operation
    6. Verify Fix - Reproduce → Fix → Verify


    1.3 Troubleshooting Priority

    Not all information is equally important. Prioritize as follows:


    🔴 P0 traces.txt Main thread stack Directly locates blocking position
    🟠 P1 traces.txt Lock-holding thread stack Finds lock contention source
    🟡 P2 logcat ANR reason Understand trigger type
    🟢 P3 logcat Before/after log context Reconstruct event timeline
    🔵 P4 system logs CPU/Memory/Binder System resource status


    Remember this priority to quickly find key information in massive logs.





    II. Complete Interpretation of traces.txt

    traces.txt is the core of ANR troubleshooting, but many people can't understand it. Let's break it down from start to finish.


    2.1 Structure of traces.txt

    A complete traces.txt contains:






    ----- pid 12345 at 2024-12-26 10:30:15 -----
    Cmd line: com.example.app
    Build fingerprint: 'brand/product/device:13/...'

    DALVIK THREADS (23):
    "main" prio=5 tid=1 Blocked
    | group="main" sCount=1 dsCount=0 flags=1 obj=0x74e04dd8
    | sysTid=12345 nice=-10 cgrp=default sched=0/0 handle=0x7b5a9f49a8
    | state=S schedstat=( 15678923456 8765432109 1234 ) utm=1500 stm=67
    | mutexes= none>
    at com.example.app.DataManager.getData(DataManager.ja va:45)
    - waiting to lock 0x0a1b2c3d> (a java.lang.Object) held by thread 15
    at com.example.app.MainActivity.onClick(MainActivity. java:100)
    ...

    "WorkerThread-5" prio=5 tid=15 Native
    ...

    "GC" daemon prio=5 tid=2 WaitingForGC
    ...







    Let's explain line by line:


    2.1.1 Process Header Information





    ----- pid 12345 at 2024-12-26 10:30:15 -----
    • pid: Process ID
    • Timestamp: Time when ANR occurred




    Cmd line: com.example.app
    • Package name: Application that triggered ANR




    Build fingerprint: 'brand/product/device:13/...'
    • System version: Brand/Product/Device/Android version


    2.1.2 Thread Information Header





    "main" prio=5 tid=1 Blocked
    • "main": Thread name
    • prio=5: Priority (1-10, higher number = higher priority)
    • tid=1: Thread ID (main thread is always 1)
    • Blocked: Thread state (This is the most critical information!)


    2.1.3 Detailed Thread State





    | group="main" sCount=1 dsCount=0 flags=1 obj=0x74e04dd8
    • group: Thread group
    • sCount: Suspend count (>0 means suspended)
    • dsCount: Debugger suspend count
    • obj: Thread object address




    | sysTid=12345 nice=-10 cgrp=default sched=0/0 handle=0x7b5a9f49a8
    • sysTid: System thread ID (corresponds to Linux layer tid)
    • nice: Process priority (-20 to 19, smaller = higher priority)
    • cgrp: cgroup grouping
    • sched: Scheduling policy
    • handle: Thread handle




    | state=S schedstat=( 15678923456 8765432109 1234 ) utm=1500 stm=67
    • state: Linux thread state
      • R (Running): Currently running
      • S (Sleeping): Interruptible sleep
      • D (Disk sleep): Uninterruptible sleep (usually IO wait)
      • Z (Zombie): Zombie process
      • T (Stopped): Stopped
    • schedstat: Scheduling statistics (runtime/wait time/time slice count)
    • utm: User-mode CPU time (unit: jiffies)
    • stm: Kernel-mode CPU time (unit: jiffies)


    2.1.4 Mutex Information





    | mutexes=
    • List of mutexes currently held by this thread
    • means no locks held


    2.2 Detailed Thread State Explanation

    The most critical in traces.txt is thread state, which directly tells you what the thread is doing:


    Runnable Currently running Dead loop, heavy computation Check code logic
    Blocked Waiting to acquire lock Lock contention Find lock-holding thread
    Waiting Waiting for notify wait() call Check if notify exists
    TimedWaiting Timed wait sleep(), wait(timeout) Check timeout setting
    Native Executing Native code JNI call, Binder call Check Native stack
    Suspended Suspended GC, debugger Check GC logs


    Key Points to Remember:
    • Blocked → Lock issue → Find lock-holding thread
    • Native → Binder/JNI → Check system logs
    • Runnable → Code logic → Check method duration


    2.3 Stack Interpretation Techniques

    Technique 1: Read from Bottom to Top

    Stacks are in reverse order—top is currently executing code, bottom is the call origin.






    at com.example.app.DataManager.getData(DataManager.ja va:45) ← Currently stuck here
    - waiting to lock 0x0a1b2c3d> ← Reason: waiting for lock
    at com.example.app.MainActivity.onClick(MainActivity. java:100) ← Who called it
    at android.view.View.performClick(View.java:7125) ← Further back
    ...
    at android.app.ActivityThread.main(ActivityThread.jav a:7356) ← Entry point







    Troubleshooting Approach:

    1. Look at top → Where is it stuck
    2. Look at waiting/locked → Why is it stuck
    3. Trace downward → Who called it, what's the business logic


    Technique 2: Identify Key Information





    - waiting to lock 0x0a1b2c3d> (a java.lang.Object) held by thread 15







    This line contains massive information:
    • waiting to lock: Currently waiting to acquire lock
    • : Memory address of lock object
    • held by thread 15: Lock held by thread 15


    Next Step: Immediately find thread 15's stack, see what it's doing!


    Technique 3: Identify Binder Calls





    at android.os.BinderProxy.transactNative(Native Method)
    at android.os.BinderProxy.transact(Binder.java:1129)







    Seeing these two lines indicates a Binder IPC call is in progress.


    Common Binder Call Scenarios:
    • ContentProvider queries
    • ActivityManagerService interactions
    • WindowManagerService interactions
    • System service calls


    Troubleshooting Direction:
    • See which system service is being called
    • Check if system service is normal
    • Check system logs for exceptions


    2.4 Practice: Reading a Complete Thread Stack

    Let's look at a real example:






    "main" prio=5 tid=1 Blocked
    | group="main" sCount=1 dsCount=0 flags=1 obj=0x74e04dd8
    | sysTid=12345 nice=-10 cgrp=default sched=0/0 handle=0x7b5a9f49a8
    | state=S schedstat=( 15678923456 8765432109 1234 ) utm=1500 stm=67
    | mutexes= none>
    at com.example.app.DataManager.getData(DataManager.ja va:45)
    - waiting to lock 0x0a1b2c3d> (a java.lang.Object) held by thread 15
    at com.example.app.MainActivity.onClick(MainActivity. java:100)
    at android.view.View.performClick(View.java:7125)
    at android.view.View.performClickInternal(View.java:7 102)
    at android.view.View.access$3500(View.java:801)
    at android.view.View$PerformClick.run(View.java:27851 )
    at android.os.Handler.handleCallback(Handler.java:883 )
    at android.os.Handler.dispatchMessage(Handler.java:10 0)
    at android.os.Looper.loop(Looper.java:214)
    at android.app.ActivityThread.main(ActivityThread.jav a:7356)







    Analysis Steps:

    1. Thread name and state - "main" + "Blocked" → Main thread is blocked
    2. Priority and scheduling - nice=-10 → High priority thread, system should prioritize scheduling
    3. CPU time - utm=1500, stm=67 → User-mode 1500 jiffies, kernel-mode 67 jiffies
      • 1 jiffy ≈ 10ms (system-dependent)
      • User-mode: 15 seconds, kernel-mode: 0.67 seconds
      • Note: This thread executed for a while, didn't freeze immediately after startup
    4. Top of stack - DataManager.getData():45 → Stuck at line 45
    5. Cause - waiting to lock held by thread 15
      • Waiting for a lock
      • This lock is held by thread 15
    6. Business logic - onClick() → performClick() → Handler → Looper
      • User clicks button
      • Event passed to main thread via Handler
      • Calls onClick()
      • Calls DataManager.getData()
      • Blocked by lock


    Conclusion:
    • Problem: Main thread waiting for lock
    • Lock holder: thread 15
    • Next step: Look at thread 15's stack, find lock-holding reason





    III. logcat Log Analysis

    traces.txt tells you "where it's stuck," logcat tells you "why it's stuck."


    3.1 Key Markers for ANR Logs

    Search for these keywords in logcat:






    # 1. ANR trigger log
    adb logcat | grep "ANR in"

    # Output example:
    12-26 10:30:15.234 1234 1250 E ActivityManager: ANR in com.example.app (com.example.app/.MainActivity)
    12-26 10:30:15.234 1234 1250 E ActivityManager: PID: 12345
    12-26 10:30:15.234 1234 1250 E ActivityManager: Reason: Input dispatching timed out (Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago. Wait queue length: 1.)







    Key Information:
    • ANR in: Package name and component name
    • PID: Process ID (corresponds to traces.txt)
    • Reason: ANR reason (Input/Broadcast/Service)




    # 2. traces file generation
    adb logcat | grep "dumpStackTraces"

    # Output example:
    12-26 10:30:15.456 1234 1250 I ActivityManager: dumpStackTraces pids=[12345, 1234] nativePids=null







    3.2 Building Event Timeline

    ANR is not an isolated event, need to see cause and effect:


    Timeline Analysis Steps:

    1. Look back 30 seconds - Find trigger event
    2. Find ANR moment - Determine blocking point
    3. Look forward 10 seconds - See system response


    Example:






    10:29:45.123 User clicks button
    10:29:45.145 onClick() called
    10:29:45.167 Business logic execution starts
    10:29:50.000 InputDispatcher detects timeout
    10:30:15.234 ActivityManager triggers ANR
    10:30:15.456 Start dumping traces
    10:30:18.789 traces.txt write complete
    10:30:20.123 ANR Dialog pops up







    3.3 Key Log Patterns

    Pattern 1: Binder Timeout





    12-26 10:30:10.123 12345 12360 W Binder : Outgoing transactions from this process must be FLAG_ONEWAY
    12-26 10:30:14.567 12345 12360 E Binder : Binder call failed.
    12-26 10:30:15.234 1234 1250 E ActivityManager: ANR in com.example.app







    Characteristics:
    • Binder-related errors appear before ANR
    • Usually "Binder call failed" or "transaction failed"


    Possible Causes:
    • Binder resources exhausted
    • Target service process died
    • Binder data too large


    Pattern 2: GC Caused





    12-26 10:30:13.123 12345 12345 I art : Background concurrent mark sweep GC freed 123456(8MB) AllocSpace objects, 45(1MB) LOS objects, 35% free, 47MB/72MB, paused 1.234ms total 5.678s
    12-26 10:30:14.890 12345 12345 I art : Background concurrent mark sweep GC freed ...
    12-26 10:30:15.234 1234 1250 E ActivityManager: ANR in com.example.app







    Characteristics:
    • Multiple GC logs before ANR
    • GC paused time is long
    • Frequent GC (possible memory leak)


    Pattern 3: IO Blocking





    12-26 10:30:10.000 12345 12360 D DatabaseHelper: query() database=/data/data/com.example.app/databases/large.db
    12-26 10:30:14.999 12345 12360 D DatabaseHelper: query() returned 50000 rows
    12-26 10:30:15.234 1234 1250 E ActivityManager: ANR in com.example.app







    Characteristics:
    • IO operation takes too long
    • Database query returns large amount of data


    3.4 logcat Filtering Techniques

    Filter by Time:






    # Extract logs 1 minute before and after ANR
    adb logcat -t '12-26 10:29:15.000' -T '12-26 10:31:15.000'







    Filter by Process:






    # Only logs for specified PID
    adb logcat --pid=12345







    Filter by Keyword:






    # Combined filtering
    adb logcat -s ActivityManager:I Binder:W art:I | grep -E "ANR|Binder|GC"







    Save to File:






    adb logcat -d > anr_logcat.txt










    IV. Tool Practice

    To do a good job, one must first sharpen one's tools. This section introduces practical tools for ANR troubleshooting.


    4.1 adb Command Collection

    4.1.1 Get traces File





    # Method 1: Direct pull
    adb pull /data/anr/traces.txt ./

    # Method 2: Via bugreport (recommended, more complete information)
    adb bugreport bugreport.zip
    # Extract and find in FS/data/anr/ directory

    # Method 3: Manually trigger ANR and get
    adb shell kill -3

    adb pull /data/anr/traces.txt ./







    4.1.2 Real-time ANR Monitoring





    # Monitor ANR triggers
    adb logcat -s ActivityManager:* | grep --line-buffered "ANR"

    # Auto-save traces (requires root)
    adb shell "while true; do
    inotifywait -e modify /data/anr/traces.txt && \
    cp /data/anr/traces.txt /sdcard/anr_$(date +%Y%m%d_%H%M%S).txt
    done"







    4.1.3 Check System Status





    # CPU usage
    adb shell top -n 1 -m 10

    # Memory usage
    adb shell dumpsys meminfo


    # Binder status
    adb shell cat /sys/kernel/debug/binder/stats
    adb shell cat /sys/kernel/debug/binder/transactions

    # Process information
    adb shell ps -A | grep








    4.2 Systrace Analysis

    Systrace can show system call conditions when ANR occurs.


    4.2.1 Capture Systrace





    # Android 9+ use Perfetto
    adb shell perfetto \
    -c - --txt \
    -o /data/misc/perfetto-traces/trace \
    EOF
    buffers: {
    size_kb: 65536
    }
    data_sources: {
    config {
    name: "linux.process_stats"
    }
    }
    duration_ms: 30000
    EOF

    # Traditional Systrace method
    python systrace.py --time=30 -o trace.html \
    sched freq idle am wm gfx view binder_driver hal dalvik input res







    4.2.2 Analyze Systrace




    Key Points:

    1. Find ANR time point - Search for "deliverInputEvent" or application package name
    2. Check main thread state - Is it Sleeping or Runnable
    3. Check lock wait - Look for "lock contention"
    4. Check Binder calls - Look for "binder transaction"


    4.3 Automation Analysis Script

    I wrote a Python script to automatically analyze traces.txt:






    #!/usr/bin/env python3
    # anr_analyzer.py

    import re
    import sys
    from collections import defaultdict

    class ANRAnalyzer:
    def __init__(self, traces_file):
    self.traces_file = traces_file
    self.threads = {}
    self.locks = defaultdict(list)

    def parse(self):
    """Parse traces file"""
    with open(self.traces_file, 'r', encoding='utf-8', errors='ignore') as f:
    content = f.read()

    # Extract process information
    pid_match = re.search(r'----- pid (\d+) at (.+?) -----', content)
    if pid_match:
    self.pid = pid_match.group(1)
    self.timestamp = pid_match.group(2)
    print(f"[Process Info] PID: {self.pid}, Time: {self.timestamp}")

    # Extract all threads
    thread_blocks = re.split(r'\n"', content)
    for block in thread_blocks[1:]: # Skip first (process header)
    self._parse_thread('"' + block)

    def _parse_thread(self, block):
    """Parse single thread"""
    # Extract thread name and state
    match = re.search(r'"(.+?)" .+ tid=(\d+) (\w+)', block)
    if not match:
    return

    thread_name = match.group(1)
    tid = match.group(2)
    state = match.group(3)

    # Extract stack
    stack_lines = []
    for line in block.split('\n'):
    if line.strip().startswith('at '):
    stack_lines.append(line.strip())

    # Extract lock information
    lock_match = re.search(r'waiting to lock .+held by thread (\d+)', block)
    if lock_match:
    lock_addr = lock_match.group(1)
    holder_tid = lock_match.group(2)
    self.locks[lock_addr].append({
    'waiter': tid,
    'waiter_name': thread_name,
    'holder': holder_tid
    })

    # Save thread information
    self.threads[tid] = {
    'name': thread_name,
    'state': state,
    'stack': stack_lines
    }

    def find_main_thread(self):
    """Find main thread"""
    for tid, info in self.threads.items():
    if info['name'] == 'main':
    return tid, info
    return None, None

    def analyze_deadlock(self):
    """Analyze deadlock"""
    print("\n[Lock Analysis]")
    if not self.locks:
    print(" No lock waits found")
    return

    for lock_addr, waiters in self.locks.items():
    print(f"\n Lock Address: {lock_addr}")
    for waiter in waiters:
    print(f" Waiting Thread: tid={waiter['waiter']} ({waiter['waiter_name']})")
    holder_tid = waiter['holder']
    if holder_tid in self.threads:
    holder = self.threads[holder_tid]
    print(f" Holding Thread: tid={holder_tid} ({holder['name']}) - {holder['state']}")
    if holder['stack']:
    print(f" Stack: {holder['stack'][0]}")

    def analyze_main_thread(self):
    """Analyze main thread"""
    tid, info = self.find_main_thread()
    if not info:
    print("\n[Main Thread] Not found")
    return

    print(f"\n[Main Thread Analysis]")
    print(f" State: {info['state']}")

    if info['stack']:
    print(f" Stack Top:")
    for i, line in enumerate(info['stack'][:5]):
    print(f" {i+1}. {line}")

    # Determine problem type
    if info['state'] == 'Blocked':
    print(" ⚠️ Main thread blocked, possible lock contention issue")
    elif info['state'] == 'Native':
    print(" ⚠️ Main thread executing Native code, possible JNI or Binder issue")
    elif info['state'] == 'Runnable':
    print(" ⚠️ Main thread running, possible dead loop or heavy computation")

    def generate_report(self):
    """Generate analysis report"""
    print("\n" + "="*60)
    print("ANR Analysis Report")
    print("="*60)

    self.analyze_main_thread()
    self.analyze_deadlock()

    print("\n[Recommendations]")
    tid, info = self.find_main_thread()
    if info:
    if info['state'] == 'Blocked':
    print(" 1. Check for lock contention")
    print(" 2. Find lock-holding thread, analyze its time-consuming operations")
    print(" 3. Consider reducing lock granularity or using read-write locks")
    elif info['state'] == 'Native':
    print(" 1. Check if Binder calls timeout")
    print(" 2. Check if JNI calls block")
    print(" 3. Check system logs to confirm system service status")
    elif info['state'] == 'Runnable':
    print(" 1. Check for dead loops")
    print(" 2. Check for heavy computation")
    print(" 3. Use Profiler to analyze method duration")

    if __name__ == '__main__':
    if len(sys.argv) != 2:
    print("Usage: python3 anr_analyzer.py ")
    sys.exit(1)

    analyzer = ANRAnalyzer(sys.argv[1])
    analyzer.parse()
    analyzer.generate_report()







    Usage:






    python3 anr_analyzer.py traces.txt







    Output Example:






    [Process Info] PID: 12345, Time: 2024-12-26 10:30:15

    [Main Thread Analysis]
    State: Blocked
    Stack Top:
    1. at com.example.app.DataManager.getData(DataManager.ja va:45)
    2. at com.example.app.MainActivity.onClick(MainActivity. java:100)
    ⚠️ Main thread blocked, possible lock contention issue

    [Lock Analysis]
    Lock Address: 0x0a1b2c3d
    Waiting Thread: tid=1 (main)
    Holding Thread: tid=15 (WorkerThread-5) - Native
    Stack: at android.os.BinderProxy.transactNative(Native Method)

    [Recommendations]
    1. Check for lock contention
    2. Find lock-holding thread, analyze its time-consuming operations
    3. Consider reducing lock granularity or using read-write locks










    V. Complete Troubleshooting Cases

    Theory covered. Let's demonstrate complete troubleshooting flows through 3 real cases, step by step.


    5.1 Case 1: ANR Caused by Main Thread Waiting for Lock

    Scenario: ANR after user clicks button


    Step 1: Check traces.txt






    "main" prio=5 tid=1 Blocked
    at com.example.app.CacheManager.get(CacheManager.java :56)
    - waiting to lock 0x0abc1234> (a java.util.HashMap) held by thread 12
    at com.example.app.DataLoader.loadData(DataLoader.jav a:89)
    at com.example.app.MainActivity$1.onClick(MainActivit y.java:45)







    Analysis:
    • Main thread Blocked
    • Waiting for HashMap lock
    • Lock held by thread 12


    Step 2: Find thread 12






    "AsyncTask #1" prio=5 tid=12 Runnable
    at java.util.HashMap.put(HashMap.java:1234)
    - locked 0x0abc1234> (a java.util.HashMap)
    at com.example.app.CacheManager.put(CacheManager.java :78)
    at com.example.app.DataSyncTask.doInBackground(DataSy ncTask.java:123)







    Analysis:
    • AsyncTask holds HashMap lock
    • Executing put operation


    Step 3: Check Code






    // ❌ Problem code
    public class CacheManager {
    private HashMapString, Object> mCache = new HashMap();

    public synchronized Object get(String key) { // Called from main thread
    return mCache.get(key);
    }

    public synchronized void put(String key, Object value) { // Called from background thread
    // If background thread takes long, main thread waits forever
    mCache.put(key, value);
    saveToDatabase(value); // Time-consuming operation!
    }
    }







    Root Cause: Background thread holds lock while doing time-consuming IO operation, main thread can't get lock


    Solution:






    // ✅ After fix
    public class CacheManager {
    private final ConcurrentHashMapString, Object> mCache = new ConcurrentHashMap();

    public Object get(String key) {
    return mCache.get(key); // Lock-free read
    }

    public void put(String key, Object value) {
    mCache.put(key, value); // Update cache first

    // Time-consuming operation async
    executor.execute(() -> {
    saveToDatabase(value);
    });
    }
    }







    Troubleshooting Time: 5 minutes


    5.2 Case 2: Binder Resource Exhaustion

    Scenario: App in background, suddenly receives many ANRs


    Step 1: Check logcat






    12-26 10:30:10.123 12345 12360 E Binder : Binder call failed.
    12-26 10:30:10.456 12345 12370 E Binder : Binder call failed.
    12-26 10:30:11.789 12345 12380 E Binder : Binder call failed.
    ...
    12-26 10:30:15.234 1234 1250 E ActivityManager: ANR in com.example.app







    Analysis: Massive Binder call failures


    Step 2: Check Binder Status






    adb shell cat /sys/kernel/debug/binder/stats

    # Output
    binder stats:
    BC_TRANSACTION: 5000
    BC_REPLY: 4500
    BC_FREE_BUFFER: 4500
    ...
    active transactions: 500 ← Abnormal! Normal should be near 0
    ready threads: 0 ← Abnormal! No available threads







    Analysis: Binder transactions piling up, thread pool exhausted


    Step 3: Check traces.txt for root cause






    "Binder:12345_1" prio=5 tid=20 Native
    at android.os.BinderProxy.transactNative(Native Method)
    at android.content.IContentProvider$Stub$Proxy.regist erContentObserver(...)

    "Binder:12345_2" prio=5 tid=21 Native
    at android.os.BinderProxy.transactNative(Native Method)
    at android.content.IContentProvider$Stub$Proxy.regist erContentObserver(...)

    ... (repeated many times)







    Analysis: Many threads registering ContentObserver


    Step 4: Locate Business Code






    // ❌ Problem code
    public class SettingsMonitor {
    public void startMonitoring() {
    // Register one Observer for each setting item
    for (String key : ALL_SETTINGS_KEYS) { // Has 500 keys!
    getContentResolver().registerContentObserver(
    Settings.System.getUriFor(key),
    true,
    new ContentObserver(mHandler) {
    @Override
    public void onChange(boolean selfChange) {
    handleChange(key);
    }
    }
    );
    }
    }
    }







    Root Cause: Registering 500 ContentObservers at once, exhausting Binder resources


    Solution:






    // ✅ After fix
    public class SettingsMonitor {
    public void startMonitoring() {
    // Only monitor Settings.System overall changes
    getContentResolver().registerContentObserver(
    Settings.System.CONTENT_URI, // Monitor entire Settings.System
    true, // notifyForDescendants=true
    new ContentObserver(mHandler) {
    @Override
    public void onChange(boolean selfChange, Uri uri) {
    String key = uri.getLastPathSegment();
    if (mInterestedKeys.contains(key)) {
    handleChange(key);
    }
    }
    }
    );
    }
    }







    Troubleshooting Time: 15 minutes


    5.3 Case 3: Main Thread Dead Loop

    Scenario: App ANRs immediately after opening


    Step 1: Check traces.txt






    "main" prio=5 tid=1 Runnable
    | state=R schedstat=( 5000000000 0 1 ) utm=5000 stm=0
    at com.example.app.StringUtils.validate(StringUtils.j ava:123)
    at com.example.app.InputValidator.check(InputValidato r.java:45)
    at com.example.app.MainActivity.onCreate(MainActivity .java:67)







    Analysis:
    • Main thread Runnable (currently running)
    • utm=5000 (user-mode CPU time 5 seconds, very high!)
    • Stuck at StringUtils.validate()


    Step 2: Check CPU Usage






    adb shell top -n 1 -m 10

    PID USER CPU% ...
    12345 u0_a123 98% ... com.example.app ← Using 98% CPU!







    Step 3: Check Code






    // ❌ Problem code
    public class StringUtils {
    public static boolean validate(String input) {
    // Catastrophic regex backtracking!
    String regex = "(a+)+b";
    return input.matches(regex);

    // When input = "aaaaaaaaaaaaaaaaaaaaaa" (no b)
    // Regex engine tries all possible match combinations
    // Time complexity O(2^n), 22 a's takes 5 seconds
    }
    }







    Root Cause: Catastrophic regex backtracking causing dead loop


    Solution:






    // ✅ After fix
    public class StringUtils {
    public static boolean validate(String input) {
    // Method 1: Optimize regex
    String regex = "a+b"; // Remove nested quantifiers

    // Method 2: Limit input length
    if (input.length() > 100) {
    return false;
    }

    // Method 3: Use Pattern.matcher() with timeout
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(input);

    try {
    // Set match timeout (requires Java 8+)
    return matcher.matches();
    } catch (Exception e) {
    return false;
    }
    }
    }







    Troubleshooting Time: 10 minutes





    VI. Troubleshooting Workflow Summary

    After the previous learning, let's summarize a standardized troubleshooting process:


    6.1 Standard Troubleshooting Checklist




    6.2 Quick Reference for Common ANR Types

    Blocked Lock contention Find lock-holding thread, see what it's doing Reduce lock granularity/use lock-free structures
    Native (Binder) Binder timeout Check system logs, check service Check Binder resources/service status
    Native (JNI) JNI blocking Check Native stack Optimize JNI calls/async
    Runnable Dead loop/heavy computation Check CPU usage, Profiler analysis Optimize algorithm/async execution
    Waiting wait() without notify Check wait condition Check notify logic
    TimedWaiting sleep() too long Check sleep duration Reduce sleep time


    6.3 Troubleshooting Pitfalls and Tips

    Pitfall 1: Only Look at Main Thread
    • ❌ Only focus on main thread stack
    • ✅ Complete analysis of all threads, find connections


    Pitfall 2: Ignore Timeline
    • ❌ Isolated view of ANR moment
    • ✅ Combine with logcat to see cause and effect


    Pitfall 3: Over-interpret traces
    • ❌ traces shows method A, think A is the problem
    • ✅ A may just be the victim, real cause is in other threads


    Tip 1: Compare Multiple traces
    • Capture multiple traces consecutively, see thread state changes
    • If state unchanged, truly stuck
    • If state changing, may be occasional lag


    Tip 2: Use Binary Search
    • If stack is long, use binary search to locate
    • Add logs in middle of stack, see if can execute
    • Gradually narrow scope


    Tip 3: Reproduce First
    • ANR that can be stably reproduced, troubleshooting efficiency 10x
    • Worth spending time finding reproduction path!





    VII. Summary

    7.1 Core Points Review

    In this article, we systematically learned the complete methodology for ANR troubleshooting:


    1. Methodology Framework
    • "Three Looks, Four Questions" thinking model
    • Priority sorting: traces.txt > logcat > system logs
    • Standardized troubleshooting process


    2. traces.txt Interpretation
    • Thread state is the most critical information
    • Blocked → Find lock-holding thread
    • Native → Check Binder calls
    • Runnable → Check CPU and code logic


    3. logcat Analysis
    • Build event timeline
    • Identify log patterns (Binder/GC/IO)
    • Connect cause and effect


    4. Tool Usage
    • adb commands to get logs and system status
    • Systrace/Perfetto visual analysis
    • Automation scripts to improve efficiency


    5. Practical Experience
    • Master actual operations through 3 cases
    • Avoid common pitfalls
    • Reproduce-first principle


    7.2 Advanced Learning Path

    After mastering this content, you can continue deep learning:

    1. Native Crash Analysis - If traces.txt shows Native methods, need to master Native Crash troubleshooting
    2. Deep Perfetto Usage - Learn Perfetto advanced features, analyze system calls and performance bottlenecks
    3. Monitoring System Setup - Build complete ANR monitoring and alerting system
    4. Performance Optimization - ANR troubleshooting is just the first step, need systematic performance optimization


    7.3 Common Questions FAQ

    Q1: What if traces.txt can't be found?


    A: traces.txt may be in these locations:

    If really not found:

    1. Check if ANR actually triggered (check logcat)
    2. Check permissions (needs root or debuggable app)
    3. Use kill -3
      to manually trigger dump


    Q2: How to Reproduce ANR?


    A: Reproduction techniques:

    1. Stress Testing - Use monkey or UIAutomator for rapid clicking
    2. Artificial Creation - sleep(10000) on main thread
    3. Environment Simulation - Use slow phone/low memory device
    4. Log Analysis - Find trigger operation through logcat


    Q3: What ANR Rate is Normal in Production?


    A: Industry standards:
    • Excellent:
    • Good: 0.1% - 0.5%
    • Needs Optimization: 0.5% - 1%
    • Severe: > 1%


    Q4: What to Do if Third-Party SDK Causes ANR?


    A: Handling strategies:

    1. Isolation - Put SDK calls in independent process
    2. Async - SDK initialization in background thread
    3. Timeout Protection - Add timeout to SDK calls
    4. Degradation - Prepare fallback when SDK fails
    5. Feedback - Contact SDK provider to fix


    7.4 Preview of Next Article

    In the next article "Exception Logging Mechanism and Process Freeze," we will learn:
    • Android's exception logging system (logcat/EventLog/DropBox)
    • Generation and analysis of tombstone logs
    • Process freeze mechanism
    • How to trace problem root causes through logs


    This knowledge will help you build a complete stability monitoring system.





    If this article helped you, welcome to share with more people! Questions are also welcome for discussion.


    Remember: Troubleshooting ANR is not about luck, but about methodology. Master the method, and you're the hero who can save the day at 2 AM! 🔥





    Author Bio: Years of Android system development experience, specializing in system stability and performance optimization. Welcome to follow this series and explore the wonderful world of Android systems together!




    🎉 Thanks for Following, Let's Explore the Wonderful World of Android Systems Together!

    Find Me: Homepage



    More...
Working...