srs/docs/perf/proxy-whep.md
winlin a217ff9a4e Proxy: Enable pprof endpoints and add WHEP performance analysis guide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 18:18:46 -04:00

150 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# How to Analyze WHEP Performance for the Proxy Server
This guide walks through profiling the Go proxy under a WHEP (WebRTC play) load.
The workload of interest is **one RTMP publisher + N WHEP players**, where N is
large enough to stress the proxy's UDP forwarding path (typically 300+).
When analyzing WHEP performance for the proxy, you should:
1. Set up the topology: proxy + SRS origin + publisher + WHEP load
2. Enable Go pprof on the proxy
3. Run the load and let it warm up
4. Collect CPU, allocation, heap, goroutine, and trace profiles
5. Read the profiles and identify hot spots
6. Save profiles to compare before and after a change
## Step 1: Build and Start the Proxy with pprof
The proxy reads `GO_PPROF` from the environment and, when set, exposes
`net/http/pprof` endpoints at that address. Use the same standard ports SRS
uses by default so the publisher and player commands stay unchanged.
```bash
cd ~/git/srs
make && env GO_PPROF=:6060 \
PROXY_RTMP_SERVER=1935 PROXY_HTTP_SERVER=8080 \
PROXY_HTTP_API=1985 PROXY_WEBRTC_SERVER=8000 PROXY_SRT_SERVER=10080 \
PROXY_SYSTEM_API=12025 PROXY_LOAD_BALANCER_TYPE=memory \
./bin/srs-proxy
```
> The pprof endpoints live under `http://localhost:6060/debug/pprof/`. The
> proxy registers them only because `internal/debug/pprof.go` blank-imports
> `net/http/pprof`. Without that import the endpoints return 404.
## Step 2: Start the SRS Origin on Alt Ports
`origin1-for-proxy.conf` runs SRS on non-standard ports (RTMP 19351, HTTP 8081,
API 19851, RTC 8001/udp, SRT 10081) so the proxy can sit on the defaults. SRS
auto-registers with the proxy's system API on startup.
Set `CANDIDATE` to a LAN-reachable IP so the SDP answer the proxy returns
points clients at an address they can route to. The proxy only rewrites the
candidate **port**; the IP comes from the origin's SDP.
```bash
ulimit -n 10000 && bash -c "cd ~/git/srs/trunk && \
CANDIDATE=192.168.3.187 ./objs/srs -c conf/origin1-for-proxy.conf"
```
## Step 3: Run the WHEP Workload
In separate terminals, start the publisher and the WHEP load generator.
**Publisher (RTMP):**
```bash
cd ~/git/srs/trunk
ffmpeg -stream_loop -1 -re -i doc/source.200kbps.768x320.flv \
-c copy -f flv -y rtmp://localhost/live/livestream
```
**WHEP players (use the LAN IP that matches `CANDIDATE`):**
```bash
cd ~/git/srs/trunk/3rdparty/srs-bench
./objs/srs_bench -sr webrtc://192.168.3.187/live/livestream -nn 300
```
Let the workload run for at least 30 seconds before sampling. Connection
setup churn dominates the first few seconds and will skew profiles taken
too early.
> Sanity-check with `-nn 1` first. If a single WHEP session does not play,
> the 300-player run is testing something other than steady-state forwarding.
## Step 4: Collect Profiles
Profiles must be collected **while the workload is steady**, not before or
after. The CPU profile is the single most useful starting point.
```bash
# CPU profile (30s sample) — interactive web UI on :8123
# Use :8123 (or any free port) because :8080 is the proxy's HTTP-FLV/HLS port.
go tool pprof -http=:8123 'http://localhost:6060/debug/pprof/profile?seconds=30'
# Allocation profile — GC pressure / per-packet allocations
go tool pprof -http=:8124 http://localhost:6060/debug/pprof/allocs
# Heap (live memory snapshot)
go tool pprof -http=:8125 http://localhost:6060/debug/pprof/heap
# Goroutine count + stack dump — look for goroutine explosion under load
curl -s 'http://localhost:6060/debug/pprof/goroutine?debug=1' | head -50
# Runtime trace (10s) — GC pauses, scheduler latency, syscall behavior
curl -s -o trace.out 'http://localhost:6060/debug/pprof/trace?seconds=10'
go tool trace trace.out
```
The web UI requires Graphviz for the Flame Graph and Graph views:
```bash
brew install graphviz # macOS
```
If you cannot install Graphviz, the **Top** view in the web UI is HTML-only
and works without it. The CLI form is also unaffected:
```bash
go tool pprof 'http://localhost:6060/debug/pprof/profile?seconds=30'
(pprof) top20
(pprof) top20 -cum
(pprof) list <FunctionName>
```
## Step 5: Read the Profiles
Open the web UI and use the views in this order:
1. **Flame Graph** — visual hot path. Wide bars near the top are where time
is spent. For 300-player WHEP the path should be dominated by
`webRTCProxyServer.Run` and its UDP read/write children.
2. **Top** — sorted list by `flat` (self time) and `cum` (cumulative). The
top 510 functions usually tell the whole story.
3. **Graph** — call graph with edge weights. Good for tracing "who calls this
hot function".
4. **Source** — line-level cost inside a single function. Use after Top has
pointed you at a function worth dissecting.
## Step 6: Save Profiles for Before/After Comparison
When you change code to fix a hot spot, comparing profiles is the only
reliable way to confirm the fix moved the needle (and didn't just shift cost
elsewhere).
```bash
# Save the raw profile from a baseline run
curl -s -o cpu-before.pb.gz 'http://localhost:6060/debug/pprof/profile?seconds=30'
# After the code change, sample again under the same workload
curl -s -o cpu-after.pb.gz 'http://localhost:6060/debug/pprof/profile?seconds=30'
# Diff the two
go tool pprof -http=:8123 -base cpu-before.pb.gz cpu-after.pb.gz
```
In the diff view, red bars are functions that got more expensive, green
bars are functions that got cheaper. The total should shrink overall if
the change is a net win.