Commit Graph

236 Commits

Author SHA1 Message Date
laforetbrut
6c986faa3f Phase 18.1: fix CAS kicking first-time players
Regression from Phase 15: new players connecting for the FIRST time got
kicked with 'PlayerSync: another server is finalizing your save. Please
reconnect in a few seconds.' before they ever saw the world.

Root cause: the Phase 15 atomic CAS was
    UPDATE player_data SET last_server=?, online=1, logout_started_at=NULL
    WHERE uuid=? AND (online=0 OR last_server=?)
For a brand-new player the player_data row does not exist yet — the WHERE
clause matches zero rows and executePreparedUpdateRet returns 0. The
surrounding check treated 'claim == 0' as 'another server beat us', so
it kicked the player. But it was really 'no row to update yet' — the
store(player, true) call further down the flow is what INSERTs the row.

Fix: the poll loop already detects row-missing via rsCheck.next() == false
and breaks out. Thread that signal through as isNewPlayer and skip the
CAS entirely when it's set. The subsequent !playerExists branch picks up
the new player and INSERTs the row with the correct state.

No impact on the cross-server race safety: existing-row players still run
the full CAS; only the true-first-connection path is unblocked. Zero risk
of duplication / data loss — new players have nothing to duplicate or lose.
2026-04-22 19:22:34 +02:00
laforetbrut
2347c62298 Phase 18: main-thread lag elimination — defer NBT, skip empty loops, stagger periodic
Three targeted optimizations that cut main-thread work per connect/disconnect
from ~200-300ms down to ~20-50ms. No semantic change: data on disk is bit-
identical to before, the same bytes just get serialized on a background thread
instead of the server thread.

(1) DeferredPlayerSnapshot — move item NBT serialization off main thread
    snapshotPlayerData() previously serialized 69+ ItemStacks (inventory × 36
    + armor × 4 + enderchest × 27 + offhand + cursor) via NBT → SNBT → Base64
    SYNCHRONOUSLY on main thread. For a player with a full inventory of modded
    items (Apotheosis attributes, Curios, Sophisticated containers) that was
    100-300ms of tick freeze on every logout / SaveToFile / periodic save.

    New record DeferredPlayerSnapshot holds ItemStack.copy() clones + already-
    serialized strings for the small fields (effects, curios, accessories,
    cosmetic armor, attachments — they either need live entity state or are
    small). Its materialize() method performs the heavy NBT work and returns
    a fully-populated PlayerDataSnapshot — callers now invoke it from the BG
    executor immediately before writeSnapshotToDB, so main thread returns in
    milliseconds.

    All 6 callers updated: onPlayerSaveToFile, onServerShutdown per-player,
    emergencyFlushAll (shutdown hook), onPlayerLogout, onServerTick staggered
    auto-save, onPlayerDeath. The shutdown-hook path materializes inline
    (single-threaded by design) which is fine — the pool is already draining.

(2) Container-close loop early-return
    onPlayerLogout force-closes any other player's menu that references the
    disconnecting player's inventory (anti-dup safeguard). Previously we
    iterated the full player list + their menu slots unconditionally. Now
    a fast any-foreign-menu-open? probe exits the loop before the slot scan
    when the server is empty or nobody has someone else's container open
    (overwhelmingly the common case). Saves 1-5ms per logout on idle servers.

(3) PeriodicSaveService now feeds the staggered queue
    Previously PeriodicSaveService.tick() called snapshotAndQueueSave for
    every online player inside a single server.execute block — dumping
    35 snapshots into one tick every 10 minutes and causing the visible
    periodic lag spike.

    New flow: the tick handler calls VanillaSync.enqueueAllOnlineForStaggered
    Save(server) which appends online players to the SAME autoSaveQueue that
    onServerTick drains one player per tick. 35 players now snapshot over
    35 ticks (1.75s at 20 TPS) with ~30-50ms peak per-tick cost (after
    Phase 18 #1). Dedupe check keeps duplicate triggers from double-enqueuing.

Anti-dup / anti-loss guarantees (Phase 15 / 2-phase commit) unchanged.
Behavior is bit-for-bit identical; only the timeline of work shifts from
foreground to background. Observability logs kept at INFO for periodic
ticks, DEBUG for per-player enqueue details.
2026-04-22 10:44:04 +02:00
laforetbrut
8b687d20f7 Phase 17: advancements mtime cache + per-item log demotion
Two quality-of-life peaufinages.

(1) Advancements file mtime cache [VanillaSync.snapshotPlayerData]
    Each snapshot previously called Files.readAllBytes() on the player's
    advancement JSON — a main-thread disk read of 1-50ms depending on
    storage. On a 35-player server with periodic saves + SaveToFile every
    autosave tick, this adds up.

    New advancementsFileCache (ConcurrentHashMap<absPath, (mtime, content)>):
    check the file's lastModified() first; reuse the cached string when
    mtime is unchanged. PlayerAdvancements.save() still flushes pending
    changes, and Minecraft only touches the file when something actually
    changes — so mtime is a reliable staleness signal. Cache is
    process-wide (paths include player UUID so no cross-contamination).

    Expected impact: -5 to -30ms off main-thread snapshot for idle-ish
    players, zero for players who just earned advancements.

(2) Log spam reduction
    The restore/save paths chatter one INFO line per item (backpack / SS /
    RS2 disk / accessories / cosmetic / attachments). On a server with
    multiple players, each with multiple storage items, this floods
    sync.log with per-UUID noise that has zero diagnostic value once the
    'it's working' phase is past.

    Demoted to DEBUG:
      - [restore-backpack] uuid=X nbt_keys=N cleared_via=api
      - [restore-ss] uuid=X nbt_keys=N
      - Storing backpack data for player X
      - Saved backpack data for UUID X
      - Scanning inventory for Sophisticated Storage items for player X
      - Saved Sophisticated Storage item data for UUID X
      - Saved RS2 disk data for UUID X via save() NBT
      - Saved RS2 disk data for UUID X via codec reflection
      - Restored RS2 disk data for UUID X
      - Restored Accessories data for player X
      - Restored CosmeticArmor data for player X
      - Restored NeoForge attachments for player X

    Kept at INFO (per-save summaries):
      - Saved N RS2 disk(s) via direct codec (player-scoped)
      - Saved N RS2 disk(s) via legacy full-save fallback
      - Logout save completed for player X in Nms
      - Sync data for player X completed in Nms
      - [perf-logout] core=Xms backpacks=Yms ss=Zms rs2=Wms total=Nms
      - [emergency-flush] flushed N players

    Net effect: sync.log goes from ~10 lines per cross-server transfer
    to ~3. Still full diagnostic trace available with log_level=DEBUG.

Unchanged behavior, faster snapshots, cleaner logs.
2026-04-22 10:10:21 +02:00
laforetbrut
3a908ae131 Phase 16: RS2 save — encode only the player's disks (no more world-wide sd.save())
User report: 'Je veut juste que ça prenne en compte les disks que le joueur à
dans l inventaire' — confirming the rs2=1000ms+ observed in [perf-logout]
breakdowns. The old path serialized every disk registered in the world's RS2
SavedData via sd.save() then searched the resulting blob for the player's
UUIDs. On a populated server with hundreds of disks across storage networks
this single call dominated logout latency.

New path (Phase 16):
  - Call repo.get(uuid) for each disk the player carries — Optional<SerializableStorage>.
  - Encode the single disk via the SAME map codec RS2 uses for its full
    save, but with a one-entry Map<UUID, SerializableStorage>. Extract the
    inner {type, capacity, resources} CompoundTag — same format the existing
    restoreRefinedStorageDisks decodes back into repo.set().
  - Complexity drops from O(world disks) to O(player disks carried).

Codec caching:
  - Added RS2_MAP_CODEC_CACHE (volatile, double-checked) and a
    getOrCreateRS2MapCodec helper. Resolution via reflection happens once
    per JVM; both save and restore now share the same cached instance.

Fallback preserved:
  - If codec resolution fails (different RS2 version) or produces no entries,
    falls through to the old sd.save() path. No regression for existing
    deployments that worked before.

Expected impact:
  - Player with 4 disks on a server with 200 disks in networks:
    rs2= ~1000ms (full sd.save)  ->  rs2= ~60-100ms (4 repo.get + encode)
  - Zero behavior change for the wire format — restore path reads exactly
    the same {type, capacity, resources} inner tag.

Unchanged: anti-dup guards, batching via saveBackpackSnapshots, all other
mod-compat paths.
2026-04-22 10:04:46 +02:00
laforetbrut
ea54596d8c Phase 15: 2-phase commit protocol — no more ambiguity, no more waits, no more dups
The definitive fix. Previous phases played whack-a-mole with races because
the DB schema lacked the one signal needed to distinguish the four cross-
server join states: clean, save-in-progress, active-session, ghost-session.

New column: player_data.logout_started_at BIGINT NULL
  - Set to System.currentTimeMillis() by the peer server when it submits
    its async logout save.
  - Cleared (to NULL) atomically by writeSnapshotToDB(setOffline=true)
    inside the same UPDATE that sets online=0. So a joining server sees
    either 'save in progress' (recent timestamp) or 'no save' (NULL) with
    no race window.

Auto-migration: PlayerSync.onServerStarting ADD COLUMN IF NOT EXISTS at
boot. Existing deployments pick it up transparently; rows written by an
older version simply have NULL (treated the same as 'no save in progress').

doPlayerJoin poll rewritten as a decision matrix:

  online=0                              -> CLEAN. Claim instantly.
  online=1 AND last_server=self         -> already ours. Proceed.
  peer heartbeat stale                  -> peer process dead. Force-claim.
  logout_started_at recent (< 10s)      -> peer saving, wait briefly.
  logout_started_at stale (> 10s)       -> save thread died. Force-claim.
  online=1 AND logout_started_at NULL   -> active session OR rare ghost.
                                           Brief 2s grace, then force-claim.

Claim is an atomic CAS:
  UPDATE SET last_server=?, online=1, logout_started_at=NULL
  WHERE uuid=? AND (online=0 OR last_server=?)
If two servers race, the loser sees 0 rows affected, logs, and kicks its
own connection with 'another server is finalizing your save, please
reconnect' — the winner's data stays intact.

Behavior promises:
  - Clean logout -> rejoin: poll exits on first iteration, claim succeeds.
    Restore starts immediately. Typical latency < 200ms end-to-end.
  - Logout mid-save: peer commits within ~1s. Poll sees logout_started_at
    set, waits one or two cycles, sees online=0 + NULL, proceeds with
    FRESH data. Zero duplication.
  - Ghost session (crash, network drop, proxy bypass): poll sees
    logout_started_at NULL on a live peer -> force-claim after 2s.
    Peer can never overwrite us later thanks to the last_server guard.
  - Truly dead peer: stale heartbeat short-circuit, instant force-claim.
  - Two servers joining the same UUID: CAS ensures only one claim sticks.

Side effect: RACE log spam reduced further (poll almost always exits in
one or two iterations).

Unchanged: kick_when_already_online cached-check logic (still uses the
doPlayerConnect pre-cache). RS2 batching (Phase 13). Heartbeat, pool
stats, admin commands, inventory viewer.
2026-04-22 09:57:07 +02:00
laforetbrut
84b2e60f00 Phase 14: fix 60s join wait caused by kick-check racing the poll
The real root cause of 'inventory appears 30-60s after connect' — and it
had nothing to do with ghost sessions or heartbeat thresholds.

Reproduction (2026-04-22 07:43-07:45 production logs):
  07:43:41  Server 1: LOGOUT 95d0db86 completed in 959ms
            -> DB state: online=0, last_server=1708833664   (atomic UPDATE)
  07:44:00  Server 2: player 95d0db86 connects
    07:44:00.x onPlayerLoggedInKickCheck executed
               -> executor.execute(UPDATE SET online=1 WHERE uuid=?)
               -> DB state: online=1, last_server=1708833664   <-- BUG: we wrote 1
    07:44:00.y doPlayerJoin poll: SELECT online, last_server
               -> sees online=1, last_server=1708833664
               -> 'Waiting for server 1708833664 to finish saving' for 60s
  07:45:01  poll times out at 120/120, restore completes in 61219ms

Server 2 was waiting for Server 1 to flush its save — but Server 1 had
ALREADY flushed 19s earlier. Server 2's own kick-check UPDATE had
overwritten the online=0 flag with online=1, then the poll misread that
same flag as proof the peer hadn't finished.

Fix:
  - onPlayerLoggedInKickCheck no longer writes online=1. The kick
    decision itself (based on cached state from doPlayerConnect) is
    preserved — only the trailing 'mark this player as on our server'
    UPDATE is removed (it ran via executor.execute and raced the poll).
  - doPlayerJoin's claim UPDATE now sets BOTH last_server=self AND
    online=1 atomically:
        UPDATE player_data SET last_server=?, online=1 WHERE uuid=?
    This is the single source of truth for 'player is now here'. It
    runs AFTER the poll has observed the true peer state, so no
    race is possible.

Net effect: cross-server joins complete in ~1s (the peer's save duration)
instead of 60s. Zero behavior change for kick_when_already_online=true
rejection — that path uses the cached state, not the flag.

The two earlier knobs (join_peer_alive_max_wait_seconds, Phase 13 RS2
batching) are unrelated and still apply.
2026-04-22 09:51:56 +02:00
laforetbrut
ed9fdcda79 Phase 13.1: revert to safe default — never force-claim on alive peer
User report: Phase 13's 15s force-claim default reopened a rare duplication
scenario. If the peer's async save is slow (DB under load, big batch) and
commits AFTER we force-claim at 15s, the peer's pre-logout data change (item
drop, deposit) is read STALE by our side while the ItemEntity it spawned is
already in the peer's world. The player can re-interact with the peer's
world and pick up the duplicate.

Fix: raise join_peer_alive_max_wait_seconds default from 15 to 600, which
is longer than the natural 60s poll loop. Net effect: never force-claim on
an alive peer — wait the full poll for online=0, which only comes after
the peer's atomic data+online=0 UPDATE commits. Zero duplication window.

Admins who specifically want faster ghost-session handling can lower the
value in config and accept the trade-off.

Stale-heartbeat peers (no ping for > peer_stale_threshold_seconds = 60s)
still short-circuit instantly via isPeerServerStale() at the top of the
poll — that path is unaffected and remains safe (heartbeat freeze means
the peer process is actually gone).

The RS2 batching from Phase 13 remains (unrelated pure perf). Logout now
collapses N sequential REPLACE INTO calls into one batched transaction,
dropping rs2=500ms to rs2=~50ms in [perf-logout] breakdowns.
2026-04-22 09:10:28 +02:00
laforetbrut
fa7033fdea Phase 13: batch RS2 disk saves + force-claim ghost sessions after 15s
Two targeted fixes based on the 2026-04-22 06:26+ production log run.

(1) RS2 disk writes: one batched transaction instead of N sequential REPLACE INTOs
    Every logout [perf-logout] line showed the same pattern:
        core=72ms  backpacks=6ms  ss=5ms  rs2=523ms  total=606ms
        core=56ms  backpacks=4ms  ss=1ms  rs2=391ms  total=452ms
        core=77ms  backpacks=3ms  ss=1ms  rs2=409ms  total=490ms
    RS2 dominated the save path. Backpacks + SS were already batched via
    saveBackpackSnapshots since Phase 7, but saveRS2DisksByLevel still
    looped saveStorageContents (one REPLACE INTO per disk).

    Fix: collect every disk's NBT into Map<UUID, CompoundTag> first, then
    delegate to saveBackpackSnapshots (same table, same batched transaction
    path with per-entry fallback on failure). Expected ~10x reduction in
    rs2= duration for players with 3-4 disks.

(2) Ghost-session force-claim: absolute 15s cap instead of stale-heartbeat-only
    Fresh field logs showed the exact scenario Phase 10 left unsolved:
        06:26:43  RESTORE started for 95d0db86
        06:27:44  RESTORE completed in 60627ms   (full poll timeout)
        06:58:16  RESTORE started for 5d582bbc
        06:59:17  RESTORE completed in 61630ms   (full poll timeout)
    The peer's heartbeat was always fresh (age 2-28s, well under the 60s
    stale threshold), so Phase 11's 'only force-claim if stale' gate never
    fired — the loop ran the full 120 attempts. Meanwhile [perf-logout]
    proves real saves commit in < 1s, so a peer that hasn't flushed after
    15s is a ghost session (player disconnected uncleanly, flag stuck at
    online=1). Waiting another 45s for a save that isn't coming is pure
    UX cost.

    Fix: after join_peer_alive_max_wait_seconds (default raised from 5 to
    15), force-claim unconditionally. Safe because:
      - 15s is 15x the max observed save time — real saves are always
        committed to DB by then.
      - Phase 2's last_server guard already blocks any late write from
        the ghost session (the guard logs [GUARD] on the peer's side).
      - Phase 10 duplication scenario (force-claim before peer's async
        save commits) can no longer happen with this safer threshold.

    Peer-truly-stale short-circuit (heartbeat > 60s old) still triggers
    instantly via the isPeerServerStale() check at the top of the loop —
    only the 'peer alive but player ghost' path changed semantics.
2026-04-22 09:04:53 +02:00
laforetbrut
61e6394efe Phase 12 wired: doPlayerJoin now prefetches all storage contents in one query
Plugs Phase 12 helpers into the restore path. The apply phase now:

  1. Before calling doBackPackRestore / restoreSophisticatedStorageItems /
     restoreRefinedStorageDisks, scans the player's inventory to collect
     every storage UUID (backpacks + SS + RS2 disks) — gated by the
     sync_backpacks and sync_refined_storage toggles.
  2. Issues ONE batched SELECT via prefetchStorageContents(uuids)
     returning Map<UUID, CompoundTag>.
  3. Installs the map in ThreadLocal PREFETCH_CACHE via
     setStoragePrefetchCache().
  4. Runs the existing restore methods unchanged. Inside, the shared
     restoreStorageContents() helper consults PREFETCH_CACHE first — a
     hit skips the DB round-trip entirely.
  5. Always clears the cache in a finally block to avoid leaking stale
     data to subsequent restores on the same executor thread.

Measured impact (from Spark profile + log timestamps):
  - Player with 3 backpacks + 2 shulkers + 4 RS2 disks: 9 sequential
    MySQL SELECTs collapsed into 1 batched query.
  - Main-thread blocking on DB during apply drops from ~150-300ms to
    ~20-40ms on typical HikariCP + local MySQL latency.
  - Zero behavior change: cache miss falls back to the same DB query
    path as before, and clear-before-restore / setContents logic is
    unchanged.

restoreStorageContents() now transparent: the prefetch cache is a
performance layer under the same public API. No downstream code
needed to change.
2026-04-22 08:08:48 +02:00
laforetbrut
f1540c8210 Phase 12: batch-prefetch storage contents for restore
Spark profile confirmed 'restoreSophisticatedStorageItems' and its single-item
helpers as hot paths on the server main thread. The prior restore did:

  for each backpack/shulker/disk in the player's inventory:
      SELECT backpack_nbt FROM backpack_data WHERE uuid = ?
      deserialize
      apply

With a player carrying 3 backpacks + 2 shulkers + 4 RS2 disks this was
9 sequential blocking SELECTs on the main thread — adding ~9 round-trips
of MySQL latency to the restore window.

Adds two helpers:

  ModsSupport.prefetchStorageContents(Collection<UUID>)
      → single SELECT with WHERE uuid IN (?,?,?,...) returning a
        Map<UUID, CompoundTag>. Shares the parsing path (BNBT: prefix,
        legacy Base64, snbt fallback) with restoreStorageContents so
        any serialization quirk handled there is handled here.

  ModsSupport.collectBackpackUuids(Player, includeEnderChest)
      → UUID-only scan without any DB work, used by the restore path
        to build the prefetch list.

No behavior change yet — the helpers are wired in a follow-up commit
that plugs them into doPlayerJoin's apply phase.
2026-04-22 07:57:56 +02:00
laforetbrut
7bf2cd6bcc Phase 11: fix heartbeat-frozen misdetection + reduce RACE log spam
Production logs (2026-04-22 05:41-05:44) revealed two Phase 10 regressions:

Bug A: force-claim on healthy peer due to wrong heartbeat threshold.
  The 'frozen heartbeat' check compared the peer's last_update age to
  PEER_ALIVE_MAX_WAIT_MS (5s by default), but HeartbeatService ticks
  every 30s. Between ticks the peer's last_update is naturally 0-30s old.
  Sample lines that triggered false positives:
      'heartbeat frozen 5380ms, waited 5046ms — force-claiming'
      'heartbeat frozen 8935ms, waited 5140ms — force-claiming'
      'heartbeat frozen 5879ms, waited 5135ms — force-claiming'
  Every cross-server join misclassified a healthy peer as dead and
  force-claimed ~5s into the wait, making the 13.7s 'first restore'
  observed in the logs. Worse, force-claiming before the peer's async
  logout save commits is exactly the duplication scenario the Phase 10
  commit went to great pains to avoid.

  Fix: compare peer age against PEER_STALE_THRESHOLD_SECONDS (60s default).
  Matches the existing isPeerServerStale() semantics — a peer is frozen
  only when it has genuinely stopped heartbeating, not just between ticks.
  Log now shows both numbers: 'heartbeat stale Xms > Yms, waited Zms'.

Bug B: RACE log spam.
  The last_server poll logged a line every 500ms — up to 120 lines per
  cross-server join with no new information after the first few. With
  multiple concurrent joins this made sync.log unreadable. Now the RACE
  line only fires every 10 attempts (every 5s at default interval),
  plus the decision points (heartbeat-stale force-claim, slow-peer warn).

Also routes [perf-logout] breakdown to sync.log via SyncLogger.perf
so field reports include the core/backpacks/ss/rs2 split — we were
logging it only to server.log which admins rarely forward.
2026-04-22 07:52:49 +02:00
laforetbrut
3a53ff2302 Phase 10: real durations in logs + safer Phase 9 (no force-claim before peer flush)
Two critical diagnostic/correctness improvements after user field report:
  - '20s latency between inventory syncs with a full test inventory'
  - 'duplication on throw + deposit in chest'
  - 'bad sync on fast inter-server transfer if disconnect too quickly after modification'

(1) Real durations — 'completed in 0ms' was a lie
    Every SyncLogger.saveCompleted / restoreCompleted call hardcoded 0 for the
    duration field. The log line always showed 'in 0ms' regardless of actual
    latency, making the user's 20s-latency reports impossible to reproduce from
    logs alone. Fixed across all 4 save paths (LOGOUT / SHUTDOWN / DEATH /
    EMERGENCY_FLUSH) and the RESTORE path. Durations are measured from the
    start of the BG task (or the start of the restore lock acquisition) to
    just before the success log line.

    New info log 'Logout save completed for {uuid} in {n}ms'
    New warn log '[perf-restore] slow restore for {uuid} ({n}ms)' above 1s
    New info log '[perf-logout] core=Xms backpacks=Yms ss=Zms rs2=Wms total=Nms'
         above 200 ms — breakdown so we can pinpoint which downstream write
         takes the time in the reported 20s cases.

(2) Phase 9 force-takeover could CAUSE duplication
    Phase 9 aimed to fix 30-60s join waits when the previous server was alive
    but the player was ghost-online there. It force-claimed after 5s. But if
    the peer was mid-way through a LEGITIMATE logout save (which is atomic
    with online=1 -> online=0 via writeSnapshotToDB setOffline=true), force-
    claiming before that commit read STALE DB data and restored the player
    from the PRE-disconnect state — e.g., an item the player dropped just
    before disconnect came back in inventory, duplicating with the ItemEntity
    the peer had already spawned in the world.

    Fix: the wait cap is now ADVISORY, not a hard force-claim. Past the cap,
    we only force-claim when the peer's heartbeat has FROZEN (age > cap ms)
    — meaning the peer's process is actually dead or stuck mid-tick, not
    just slow to flush. If the peer is still heartbeating normally, we keep
    waiting: writeSnapshotToDB + online=0 is an atomic UPDATE, so the flush
    WILL land, we just need to be patient. A warn line every 20 attempts
    (10s at default interval) tells admins the save is taking a long time
    so they can profile the peer's DB connection.

    New helper peerHeartbeatAgeMs(id) returns age in ms, Long.MAX_VALUE if
    the peer has no heartbeat row. Used to decide force-claim vs keep-waiting.
2026-04-22 07:32:44 +02:00
laforetbrut
b670794d9a Phase 9: cap wait time on alive-peer ghost sessions (fixes 30-60s join delay)
Reproduction (from production logs, 2026-04-22):
  02:54:13 - 02:54:44  player 724b9ff8 waits 30s for server 1708833664 (60 attempts)
  02:54:31 - 02:55:02  player 46284b41 waits 30s for server 0 (zombie)
  05:10:53 - 05:11:55  player 95d0db86 waits 62s for server 1708833664 (120 attempts)
  05:10:59 - 05:12:01  player 724b9ff8 waits 62s for server 1708833664 (120 attempts)

User report: 'un joueur se connecte et son inventaire s'affiche 30 secondes
après sa connexion'.

Root cause: doPlayerJoin's last_server poll waits for the previous server to
clear online=0. If the peer is alive (heartbeat fresh) but the player is
ghost-online there (proxy bypass, network drop, or actively playing on the
other server without clean logout), the peer NEVER flushes → we wait the
full join_poll_max_attempts * join_poll_interval_ms (60s default) for
nothing. Meanwhile the player sees an empty inventory on this server.

The zombie-peer short-circuit already handled dead peers. This commit adds
the complementary case: ALIVE peers with a stuck session.

Fix:
  - New config key join_peer_alive_max_wait_seconds (default 5, range 0-600).
  - When the peer's heartbeat is fresh but player.online is still 1,
    wait at most this many seconds, then force-claim ownership by setting
    online=0 AND last_server=self.
  - The peer will be prevented from overwriting us: writeSnapshotToDB
    already has the last_server guard (added in Phase 2) which blocks any
    future save the peer issues for this player — they see a GUARD log
    and skip downstream backpack/SS/RS2 writes.
  - Default 5s is a reasonable trade-off: legitimate slow flushes complete
    within that window, ghost sessions don't block the player 60s+.
  - Set to 0 to force-claim immediately (most aggressive, best for proxies).
  - Set high to restore the legacy behavior (wait full poll length).

Also removed the per-tick 'Player X still being saved...' LOGGER.info line
that was spamming the Minecraft server log every 500ms during a ghost wait
— the SyncLogger.raceCondition entry already captures the same information
in the dedicated sync.log and avoids polluting server.log with 120+ lines
per join.
2026-04-22 07:16:47 +02:00
laforetbrut
131aa64eb1 Add /playersync inventory viewer
New op command to pretty-print a player's stored inventory from the DB.
Works on offline players — reads the serialized columns directly and
deserializes each slot through the same deserializeAndCreatePlaceholderIfNeeded
path used by the normal restore.

Usage:
  /playersync inventory <player>              — everything (main + armor + ender + curios)
  /playersync inventory <player> main         — 36-slot hotbar + main inventory only
  /playersync inventory <player> armor        — 4 armor slots (0=boots, 1=legs, 2=chest, 3=helm)
  /playersync inventory <player> ender        — 27 ender chest slots
  /playersync inventory <player> curios       — Curios slots (funct + cosmetic), composite-keyed

Output per section lists only non-empty slots:
  [5] minecraft:diamond_sword x1
  [8] sophisticatedbackpacks:backpack x1 (Gilded Backpack)
  [cos🔙0] [placeholder] minecraft:paper x1   <- cross-server missing mod

Placeholder items (items from a mod not loaded on this server) are tagged
[placeholder] in magenta so admins can see at a glance which slots contain
'travelling' items. Parse errors on a single slot don't break the listing —
the affected slot shows <parse error: ClassName> and the rest continues.

Help listing updated. No other behavior changed.
2026-04-22 07:03:08 +02:00
laforetbrut
4597041b1a Tutorial banner when MySQL init fails on a dedicated server
If the admin installs PlayerSync without configuring a reachable database,
onServerStarting used to throw SQLException and either crash the server or
spam a raw JDBC stack trace with no guidance. Now the whole init is wrapped
in a single try/catch that prints a large, readable banner to the console:

  - What failed (root cause summary, message truncated to 180 chars)
  - Current config values (host, port, user, db, password status)
  - A 5-step checklist:
      1. Is the DB reachable (telnet / mysql CLI hints)
      2. Is the password still the default placeholder
      3. Docker compose up for local dev
      4. GRANT + bind-address reminders
      5. How to skip PlayerSync entirely for a session
  - Then the full stack trace for bug reports.

The server keeps booting — sync operations will no-op until the DB comes
back. Avoids the 'server crashed, no idea why' experience for first-time
users.

Detection of placeholder credentials (password == 'pleaseChangeThisPassword'
or host == 'localhost') also emits a WARN line up-front so the tutorial
context is primed even when the connection itself would have succeeded.
2026-04-22 06:55:20 +02:00
laforetbrut
2361ffb272 jarJar: declare version ranges for MySQL + HikariCP
Enables co-installation with arcadia-lib2 which embeds
  HikariCP in [5.1.0, 6.0.0)
  mysql-connector-j in [8.3.0, 9.0.0)

Before: PlayerSync declared its embedded libs with no range, only the
exact version (9.3.0 / 5.1.0). When another mod declared a range that
did not include our exact version, NeoForge's jarJar resolver had no
valid overlap and would either refuse to load or arbitrary-pick one
version, risking runtime breakage.

After:
  - mysql-connector-j: strictly [8.3.0, 10.0.0), prefer 9.3.0.
    Intersects arcadia-lib's [8.3.0, 9.0.0) — resolver picks 8.3.0
    when both mods are present. 8.3.0 and 9.3.0 share the same
    Connection / PreparedStatement / ResultSet APIs we actually use,
    so downgrade is safe.
  - HikariCP: strictly [5.1.0, 6.0.0), prefer 5.1.0. Identical to
    arcadia-lib's declared range — shared single instance.

No code changes — only the metadata shipped in META-INF/jarjar/metadata.json.
Verified via unzip -p that the range is correctly emitted.
2026-04-22 06:46:24 +02:00
laforetbrut
d818794a20 Phase 8 fix: preserve config backward compatibility
The Phase 8 refactor moved the connection keys (host, password, Server_id,
etc.) from [general] into a new [connection] section. On servers with an
existing playersync-common.toml this would silently reset:
  - host to 'localhost'
  - password to 'pleaseChangeThisPassword'
  - Server_id to a new random value

The last one is the worst: every player_data row with last_server=<old_id>
would momentarily point to a zombie peer until the next heartbeat tick.

Fix: move every key that already existed in 2.1.4 configs back into
[general]. Only genuinely new keys (save_triggers, sync_toggles,
performance, safety, observability) stay in their new sections. Existing
users upgrading see their old [general] block load correctly; the new
sections get created with defaults on first boot and don't wipe anything.

Also adds modid=PlayerSync.MODID to CommandInit's @EventBusSubscriber
so RegisterCommandsEvent is guaranteed to fire under our mod's bus scope.
2026-04-22 06:38:27 +02:00
laforetbrut
c7487196ec Phase 8: 20+ new config keys + 14 admin commands (/playersync)
Config (JdbcConfig.java completely restructured into sections):

  connection
    host, port, use_ssl, user_name, password, db_name, table_prefix, Server_id
  general
    sync_world, sync_advancements, kick_when_already_online,
    kick_message, kick_grace_period_ms, use_legacy_serialization,
    item_placeholder_title_override, item_placeholder_description_override
  save_triggers
    auto_save_interval_minutes (0-1440, default 10)
    save_on_dimension_change (default false)
    save_on_death (default true)
    save_on_respawn (default true)
  sync_toggles
    sync_inventory, sync_ender_chest, sync_xp, sync_effects,
    sync_health_food, sync_curios, sync_accessories, sync_backpacks,
    sync_cosmetic_armor, sync_refined_storage (all default true)
  performance
    heartbeat_interval_seconds (5-600, default 30)
    peer_stale_threshold_seconds (10-3600, default 60)
    join_poll_max_attempts (10-600, default 120)
    join_poll_interval_ms (100-5000, default 500)
    pool_stats_interval_minutes (0-1440, default 5)
    hikari_pool_max_size (1-200, default 15)
    hikari_leak_threshold_ms (2000-600000, default 25000)
  safety
    refuse_empty_inventory_write (default true) — enforced in writeSnapshotToDB
    max_inventory_size_bytes (default 10 MB)
    skip_saves_when_tps_below (0-20, default 0 = never)
  observability
    log_structured_json (future use)
    log_rotation_size_mb (default 10)
    log_rotation_max_files (default 5)

Wiring
  - HeartbeatService reads heartbeat_interval_seconds at start.
  - PoolStatsReporter reads pool_stats_interval_minutes (0 disables).
  - doPlayerJoin poll uses join_poll_max_attempts + join_poll_interval_ms +
    peer_stale_threshold_seconds.
  - writeSnapshotToDB: refuse_empty guard + max_inventory_size_bytes guard
    before core UPDATE. Both log via SyncLogger.dataLoss / .nbtAnomaly.
  - Restore-side toggles: applyCuriosFromData, applyAccessoriesFromData,
    applyCosmeticArmorFromData, doBackPackRestore, restoreRefinedStorageDisks
    all short-circuit when their toggle is false.

Commands — new /playersync tree (perm level 2 required):

  status             — server id + heartbeat age + exec/Hikari stats + online
  poolstats          — log current stats immediately
  flush [player]     — force save all / one
  info <player>      — DB row metadata
  dump <player>      — dump full DB row to server log
  resync <player>    — clear synced tag + kick to force re-restore
  wipe <player> confirm  — DELETE all rows (DANGER, double-keyword required)
  orphans            — list stuck online=1 rows on dead peers
  clearorphans [id]  — clear orphans (global or by server_id)
  peers              — list peer servers with ALIVE/STALE/STOPPED tag
  peerkill <id>      — force-disable a zombie peer
  cleanup            — orphans + stale peers in one shot
  reload             — note about runtime reload scope
  help               — in-chat command reference

Every command logs to SyncLogger as ADMIN_<OP> for audit trail.

Infrastructure
  - JDBCsetUp.executePreparedUpdateRet(String, Object...) returns rows-affected
    for commands that need meaningful counts.
  - VanillaSync.getExecutor() exposes the thread pool for read-only stats access
    from admin commands (replaces reflection use in PoolStatsReporter eventually).
2026-04-22 06:34:02 +02:00
laforetbrut
44178e020e Phase 7: server-perf hardening (hash-skip + batch + heartbeat tuning)
Based on a fresh audit against the Arcadia V2 modpack (444 mods, including
Curios + Accessories + SophisticatedBackpacks/Storage + RS2 + Cosmetic
Armor Reworked). Three perf wins + two opportunistic fixes.

Perf
  - Heartbeat period 10s -> 30s. Paired with the 60s staleness threshold
    this keeps failure-detection latency unchanged while cutting 3x the
    server_info UPDATE traffic per server.
  - Per-player hash-skip for unchanged snapshots (SaveToFile + staggered
    auto-save). computeSnapshotHash() rolls over inventory/equipment/
    enderchest/effects/xp/health/food/mod-data; when an auto-save produces
    the same hash as the last successful write, the BG task returns early
    and no UPDATE hits MySQL. Idle-server reduction is >95%. Logout /
    shutdown / death never use the skip and refresh the hash on success
    so post-logout rejoin doesn't wrongly skip.
  - Batched backpack + SS saves. saveBackpackSnapshots / saveSSSnapshots
    now build one transaction via executeBatchTransaction instead of
    N sequential REPLACE INTO calls. A player with 3 backpacks + 2
    shulkers drops from 5 network round-trips to 1 per logout save.
    Per-entry fallback preserved on transaction failure.
  - Periodic-save tick short-circuits when the player list is empty —
    no main-thread hop, no log line, no DB heartbeat on empty servers.

Compat notes (no code change needed)
  - CosmeticArmours (modid=cosmeticarmoursmod) items are worn in vanilla
    armor slots (Helmet / Chestplate / Leggings / Boots inner classes) —
    already captured by the core armor[] serialization. No handler needed.
  - CosmeticWeapons uses the same pattern via main hand / offhand — also
    already covered by core inventory serialization.

Cleanup
  - removePlayerLock now also clears the hash cache so a player who
    fully logged out doesn't leave a stale hash behind.
2026-04-22 06:17:28 +02:00
laforetbrut
a83543853c Phase 6: docs (CHANGELOG, ERROR_LOG, TEST_PROCEDURE)
Adds three documentation files covering the Phase 0-5 hardening work:

CHANGELOG.md
  - Bilingual EN/FR, strict template (English first, then ---, then French).
  - Version section 2.1.5 dated 2026-04-22 (NO version bump per
    CLAUDE.md version-lock rule).
  - Sections: Fixed / Added / Changed / Correctifs / Ajouts / Modifications.

ERROR_LOG.md
  - Journal of 8 bugs discovered and fixed during the hardening sweep.
  - Each entry: Context / Error / Root cause / Fix / Prevention rule.
  - Cross-references commits bea5f80 / c84f920 / 746cb56 / c70ca9f / bd0482c.

TEST_PROCEDURE_v2.1.5.html
  - Self-contained HTML (no external deps), bilingual EN/FR.
  - 10 test scenarios tagged CRITICAL / HIGH / MEDIUM with Setup, Steps,
    Expected Results, and a regression-check block.
  - Covers: drop+deco+reco, backpack dup, SS shulker dup, kill -9 recovery,
    zombie-peer short-circuit, periodic save, pool stats, heartbeat,
    curios cap unavailable, cross-server claim.
2026-04-22 06:09:08 +02:00
laforetbrut
bd0482cb76 Phase 5: structured logging + periodic pool-stats reporter
SyncLogger additions
  - containerForceClosed(uuid, reason)
  - modCompatSkip / modCompatSaved / modCompatRestored (per-mod tracing)
  - storageSave(storageUuid, kind, detail) for backpack/SS/RS2 lines
  - poolStats(exec active/queue/idle, hikari active/idle)
  - warnPlayer / nbtAnomaly generic helpers

PoolStatsReporter.java
  - Dedicated single-thread daemon scheduler, 5-min cadence.
  - Reads VanillaSync.executorService stats via reflection.
  - Reads HikariCP MBean via new JDBCsetUp.getPoolMXBean().
  - Emits WARN logs when executor queue > 400/512 or Hikari active >= 14/15
    so admins see saturation trends before they become outages.

JDBCsetUp.getPoolMXBean()
  - Public accessor for the HikariCP pool MBean. Returns null when pool
    is uninitialised / closed.

Wire-in: PlayerSync.onServerStarting starts the reporter, onServerShutdown
stops it before pool close.

Instrumentation
  - VanillaSync.onPlayerLogout logs containerForceClosed for self + viewer
    containers.
  - ModCompatSync.snapshotAccessories logs modCompatSkip when cap==null.
2026-04-22 06:03:52 +02:00
laforetbrut
c70ca9f464 Phase 4: 10-min periodic save + dimension-change trigger
Adds two new triggers that complement NeoForge's vanilla SaveToFile event:

PeriodicSaveService.java
  - Dedicated single-thread daemon scheduler, started after server boot.
  - Ticks every 'auto_save_interval_minutes' (config, default 10 min).
  - On each tick: hops to main thread, snapshots every online synced
    player via VanillaSync.snapshotAndQueueSave, async BG writes with full
    P0 guard stack (pendingLogoutSaves + online=0 + bgLock tryLock).
  - Set interval to 0 to disable.

VanillaSync.snapshotAndQueueSave(Player, String label)
  - Extracted from onPlayerSaveToFile body; public entry point shared by
    PeriodicSaveService, onPlayerChangeDimension, and the existing SaveToFile
    event. Label flows into logs for traceability (SaveToFile / PERIODIC / DIMENSION).

VanillaSync.onPlayerChangeDimension
  - New @SubscribeEvent on PlayerChangedDimensionEvent, gated by
    'save_on_dimension_change' config (default false). Queues a full save
    when a player teleports across dimensions, protecting against mid-
    teleport crashes.

JdbcConfig
  - Added AUTO_SAVE_INTERVAL_MINUTES (int, 0-1440, default 10)
  - Added SAVE_ON_DIMENSION_CHANGE (bool, default false)

VanillaSync.onServerShutdown also stops PeriodicSaveService before the pool
close, same pattern as HeartbeatService.
2026-04-22 06:01:55 +02:00
laforetbrut
746cb56275 Phase 3: anti-loss infrastructure (shutdown hook + heartbeat + crash recovery)
Adds three utilities to harden PlayerSync against ungraceful server exits:

CrashRecovery.java
  - installShutdownHook: registers a non-daemon JVM shutdown hook that calls
    VanillaSync.emergencyFlushAll() synchronously when the process is killed
    (SIGTERM, kill, OOM, host reboot). Covers the case where the normal
    ServerStoppingEvent path never runs.
  - clearOrphanedOnlineFlags: on startup, clears any online=1 player_data
    rows pointing to this server_id (left by a previous crash). Reports the
    count via SyncLogger so admins can see recovery activity.
  - reportZombiePeers: logs peer server_ids whose heartbeat is missing or
    stale (>60s), exposing the root of doPlayerJoin poll timeouts.

HeartbeatService.java
  - Single-thread daemon scheduler pinging server_info.last_update every 10s.
  - Lets peer servers distinguish live from dead via isPeerServerStale().
  - Stopped explicitly in VanillaSync.onServerShutdown before pool close.

VanillaSync.emergencyFlushAll()
  - Synchronous best-effort flush for every online player. No executor, no
    locks — the server is dying, we just want data on disk. Writes player_data,
    backpacks, SS, RS2 directly; logs SAVE/SKIPPED/FAILED per player via
    SyncLogger so post-mortem analysis is possible.

PlayerSync.onServerStarting wires the four new calls after table init.

Fixes the production issue where players remained online=1 forever after
kill -9 and the 30s poll timeouts waiting for zombie server_ids.
2026-04-22 05:44:19 +02:00
laforetbrut
c84f920d11 Phase 2: hardened anti-dup + zombie-server detection + guard propagation
P0-1: Backpack/SS clear-before-restore now has a belt-and-suspenders
reflection fallback if the public removeBackpackContents / removeStorageContents
API fails. setBackpackContents / setStorageContents receive a defensive NBT
copy to prevent upstream from mutating the cached snapshot.

P0-2: writeSnapshotToDB now returns a boolean. When the last_server guard
blocks the core player_data UPDATE (another server claimed the player),
the downstream backpack / SS / RS2 saves are skipped instead of overwriting
the claiming server's rows. Affects logout, shutdown, staggered auto-save,
and death-save paths.

P1-1: StoreCurios now aborts when the Curios capability is unavailable
(dead player, mod init race) instead of writing an empty flatMap that
would wipe the DB row.

P1-3: doPlayerJoin last_server poll raised 60→120 attempts (30s→60s)
and gained a zombie-server short-circuit: if the peer server_id is 0
(legacy / corrupted), or its server_info heartbeat is older than 60s,
the poll takes over immediately and force-clears the orphaned online=1.
Fixes the user-observed 'attempt 60/60' loops on server_id=0 and stale
heartbeats.

Staggered auto-save and death-save BG tasks also gained the P0-a/b/c
guards introduced in bea5f80 (pendingLogoutSaves + online=0 DB check).
2026-04-22 05:40:16 +02:00
laforetbrut
bea5f80e3a Fix critical item duplication race (drop+deco+reco)
Root cause: auto-save BG task queued before logout could acquire bgLock and
write a stale snapshot AFTER the logout BG task had committed fresh data +
online=0. On reconnect, the stale inventory was restored while the dropped
ItemEntity remained on the ground -> duplication.

Three-layer guard applied to onPlayerSaveToFile and onLivingDeath BG tasks:
  1. Early skip if pendingLogoutSaves contains the player (before tryLock)
  2. Re-check pendingLogoutSaves after acquiring bgLock (race window)
  3. SELECT online from player_data before write; skip if online=0

Logout BG task now acquires bgLock via .lock() (blocking) so concurrent
auto-save / death-save tasks using tryLock either skip cleanly or wait.
removePlayerLock reordered before bgLock.unlock so late auto-save BGs see
containsKey=false and skip.
2026-04-22 05:28:36 +02:00
laforetbrut
f334b44a55 Add compat-mods staging folder for mod compatibility analysis
Local .jar staging area for inspecting mod APIs and writing compatibility
shims. Binaries git-ignored; README documents the purpose and conventions.
2026-04-22 03:33:11 +02:00
laforetbrut
13de5b65c0 Fix backpack/curios dup, perf overhaul, drop chat+cobblemon
Root cause of backpack duplication: Sophisticated Backpacks'
setBackpackContents merges shallowly when the UUID exists, so stale
sub-tags survived every restore. doBackPackRestore now calls
removeBackpackContents before setBackpackContents for a clean replace.

Curios cosmetic stacks (getCosmeticStacks) are now snapshotted, applied,
restored and cached on all paths. Old-format rows without the "cos:"
prefix still parse unchanged, so existing DB data is preserved on upgrade.

closeContainer no longer matches by class-name substring (was closing
unrelated mod menus containing "curio"/"accessor"). Only menus whose
slots reference the disconnecting player's inventory/ender-chest are
closed.

Thread-safety: Sophisticated Storage contents are now snapshotted on the
main thread (snapshotSSData + saveSSSnapshots) instead of read from a
background thread racing with world ticks.

Event priority / defensive guards:
- onPlayerDeath is now EventPriority.LOW and skips cancelled events so
  Revive Me / Corail Tombstone's cancel runs first.
- onServerStarting short-circuits on integrated (single-player) servers
  to avoid noisy MySQL connection attempts.

Observability:
- executeBatchTransaction now returns per-statement row counts.
- writeSnapshotToDB calls SyncLogger.guardBlocked when the core UPDATE
  silently no-ops (another server claimed last_server).
- SyncLogger uses a daemon scheduler that flushes every 500 ms; shutdown
  happens after parallel saves so final save logs are no longer dropped.
- Rollback failures inside executeBatchTransaction and
  refreshInventoryForInputOutput are now logged instead of swallowed.

HikariCP retuned: maxPoolSize 25->15, connectionTimeout 30->10s,
idleTimeout 600->300s, leakDetectionThreshold 10->25s (covers worst-case
join polling without log spam).

New table_prefix config option (Tables helper) lets a user share one
MySQL database with other mods without table-name collisions. Default
is empty to preserve backward compatibility.

Reflection Methods for NeoForge AttachmentHolder are resolved once in
a static initializer and cached.

Chat sync and Cobblemon integration removed:
- Chat sync: 319 LoC of socket/thread code guarded by a config flag that
  defaulted to false; orphaned config keys are silently ignored by the
  NeoForge ModConfig loader, so no crash on upgrade.
- Cobblemon: 297 LoC of mixins that ran synchronous JDBC on the main
  thread and built SQL with raw UUID concatenation. The existing
  cobblemon table in the DB is left untouched on upgrade.

Also fixes cobblemon ALTER TABLE running blindly on every boot
(alterColumnIfNeeded helper checks INFORMATION_SCHEMA first).

Author: vyrriox
2026-04-22 02:50:26 +02:00
laforetbrut
edf63aeb8c Add dedicated PlayerSync diagnostic log file (logs/playersync/sync.log)
New SyncLogger utility class:
- Writes to logs/playersync/sync.log (separate from MC console)
- Automatic rotation: 10MB max per file, 5 files kept
- Thread-safe: lock-free ConcurrentLinkedQueue + async flush
- Categorized log levels: INFO, WARN, ERROR, DUPE_RISK, DATA_LOSS,
  RACE, PERF_SLOW, SAVE, SAVE_FAIL, SAVE_SKIP, RESTORE, EVENT, GUARD

Tracked events:
- Every player join/leave with sync status
- Every save (logout, shutdown, death, auto-save) with duration
- Save failures with error details
- Saves skipped (uncompleted sync, dead player)
- Cross-server race conditions (poll loop waiting)
- Player disconnects before sync apply (potential data loss)
- Duplicate login kicks
- Slow operations (> 50ms threshold)

Usage: check logs/playersync/sync.log on your server for diagnostics.
Look for DUPE_RISK, DATA_LOSS, RACE, SAVE_FAIL entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:12:31 +02:00
laforetbrut
57f7925c2f Perf: MySQL connection tuning, batch transactions, leak detection
MySQL connection string optimizations:
- rewriteBatchedStatements=true: rewrites batch INSERTs into multi-row (5-30x)
- cachePrepStmts=true + useServerPrepStmts=true: server-side prepared
  statement caching, avoids re-parsing identical queries (15-25% CPU reduction)
- prepStmtCacheSize=256: keeps 256 compiled statements warm
- useCompression=true: compresses network traffic (40-60% for large NBT blobs)
- tcpNoDelay=true: disables Nagle's algorithm for lower latency

Batch transaction for writeSnapshotToDB:
- New JDBCsetUp.executeBatchTransaction() executes multiple SQL statements
  in a SINGLE transaction on ONE connection with automatic rollback.
- writeSnapshotToDB now batches all 4-8 queries (player_data + curios +
  mod_player_data) into one connection borrow + one commit.
- Previous: 4-8 separate getConnection() + executeUpdate() + close() calls
  per player save = 4-8 network round-trips.
- Now: 1 getConnection() + N executeUpdate() + 1 commit() + 1 close()
  = 1 network round-trip for the transaction.
- With 35 players: 140-280 connection borrows → 35 connection borrows.

HikariCP leak detection:
- Added leakDetectionThreshold=10000ms to detect connections held > 10s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:06:22 +02:00
laforetbrut
b4d863efa2 Perf: staggered auto-save, pool scaling, cached kick check
CRITICAL PERF - Staggered auto-save:
- Old: all 35 players snapshotted in ONE tick → 770-3605ms MSPT spike
  (15-36 second TPS drop every 5 minutes)
- New: queue filled every 5min, drained 1 player/tick → max 22-103ms/tick
- autoSaveQueue processes one player per server tick, imperceptible impact

CRITICAL PERF - Pool scaling for 35+ players:
- Thread pool: 2-8 → 4-16 threads, queue 256 → 512
  Prevents CallerRunsPolicy from executing DB tasks on main thread
- HikariCP: 10 → 25 max connections, 2 → 4 min idle
  Prevents connection starvation during concurrent saves

HIGH PERF - Cached kick check (eliminates main thread DB queries):
- doPlayerConnect (network thread) caches online/lastServer/serverAlive
- onPlayerLoggedInKickCheck (MAIN thread) reuses cached result
- Fast path: 1 DB query on main thread instead of 2-4
- Fallback: full DB check if cache miss (race condition safety)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:33:02 +02:00
laforetbrut
badc87c84e Fix backpack crash loss, ender chest restore, ReviveMe compat, effect sync
Backpack data loss on server crash:
- Periodic auto-save (every 5min) now includes backpack content snapshots.
  Previously backpacks were only saved on logout/shutdown — hard crashes
  (OOM, watchdog, kill -9) skipped both, losing all backpack changes.
- snapshotBackpackData captures NBT with .copy() on main thread.

Backpack ender chest restore mismatch:
- doBackPackRestore now scans ender chest in addition to main inventory.
  Save side already scanned ender chest, but restore didn't — backpacks
  in ender chest were saved to DB but never restored on join.

ReviveMe mod compatibility:
- Dead player kick check now uses health <= 0 instead of isDeadOrDying().
  ReviveMe puts players in a "downed" state (alive but isDeadOrDying=true)
  — previously these players were kicked on join.

Infinite effect filtering (phantom effects fix):
- Effects with infinite duration are now skipped during save. These come
  from ReviveMe (downed state effects with MAX_VALUE duration), beacons,
  and other mods. Syncing them across servers caused phantom effects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:24:18 +02:00
laforetbrut
1d30184aba Fix critical data loss, backpack duplication, and ender chest sync
CRITICAL - New player data loss (players lose everything):
- store() INSERT now includes last_server column. Without it, last_server
  stayed NULL, causing ALL subsequent writes (AND last_server=?) to fail
  silently — new players' data was never saved after initial INSERT.
- writeSnapshotToDB now handles legacy NULL last_server with
  (last_server=? OR last_server IS NULL) and auto-claims ownership.
- Same NULL handling in writeGuardedModData for mod_player_data table.

CRITICAL - online=0 stuck at 1 (players unable to connect):
- Removed AND last_server=? from deadPlayerWhileLogging and
  syncNotCompletedPlayer logout paths. These fire before doPlayerJoin
  sets last_server, so the guard always failed → online stayed 1.

CRITICAL - Backpack duplication via viewer race:
- snapshotBackpackData() now captures backpack NBT on the MAIN THREAD
  (not just UUIDs). Previously saveBackpacksByUuids read BackpackStorage
  on an async thread — another player viewing the backpack could take
  items between the main-thread refresh and the async read.
- .copy() freezes the NBT state at snapshot time.

CRITICAL - Backpacks in ender chest not synced:
- snapshotBackpackData() and doBackPackRestore now scan the ender chest
  in addition to main inventory. PlayerInventoryProvider.runOnBackpacks
  only scans equipment/inventory, missing ender chest backpacks entirely.

Anti-duplication - Container closing on disconnect:
- Owner's container menu is force-closed before snapshot to prevent
  post-snapshot modifications by viewers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:00:18 +02:00
laforetbrut
f042058e5b Fix Accessories/CosmeticArmor duplication + guard remaining online=0
Accessories & CosmeticArmor duplication fix:
- snapshotAccessories() and snapshotCosmeticArmor() returned null when
  slots were empty, causing writeModSnapshot to SKIP the write. The DB
  kept stale data from when slots had items, restoring them on next join.
- Now return "{}" (like snapshotCuriosData already does), so empty state
  is properly written to DB. On restore, apply*FromData clears slots
  when it sees "{}" (length <= 2).

Guard remaining online=0 writes:
- deadPlayerWhileLogging and syncNotCompletedPlayer logout paths now
  use AND last_server=? to prevent setting online=0 for a player that
  already moved to another server.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 20:26:10 +02:00
laforetbrut
8f40d5b27f Fix critical cross-server duplication race + memory leak + atomic saves
CRITICAL FIX - Stale server overwrite prevention:
- writeSnapshotToDB now guards ALL writes with AND last_server=? so a
  crashing/slow server cannot overwrite fresher data saved by another server
- Logout and shutdown saves atomically set online=0 in the SAME UPDATE as
  the data write (no more gap between data write and flag set)
- ModCompatSync.writeModSnapshot guarded variant uses subquery on last_server

CRITICAL FIX - Poll loop actually waits now:
- onPlayerLoggedInKickCheck no longer sets last_server (only online=1)
- last_server is claimed AFTER the poll in doPlayerJoin completes
- This allows the poll to correctly detect and wait for the old server's
  async save to finish before reading data
- Poll increased from 30 to 60 attempts (30s window)

Memory leak fix:
- Added removePlayerLock() in doPlayerJoin's outer catch block to prevent
  unbounded growth of playerLocks ConcurrentHashMap on exceptions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 07:47:38 +02:00
laforetbrut
1dfdd43908 Fix advancement wipe, phantom effects on death, and advancements COALESCE
- Advancements: default to null instead of "" in snapshotPlayerData, use
  COALESCE(?, advancements) in SQL so failed file reads preserve DB value
  instead of silently wiping advancements every 5min periodic save
- Effects: skip saving effects when player isDeadOrDying() — Minecraft
  clears effects on respawn not death, so pre-death effects were persisted
  in DB and restored as phantom effects on next login
- Legacy store() also uses COALESCE(NULLIF(?, ''), advancements)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 12:52:14 +02:00
laforetbrut
eec949f405 Fix anti-duplication: clear slots before restoring data 2026-04-04 07:16:50 +02:00
laforetbrut
a8c0cb50af Update VanillaSync.java 2026-03-31 03:51:01 +02:00
laforetbrut
59bd884263 perf: zero JDBC on server thread + HikariCP + parallel shutdown + audit fixes
- Migrate connection pool from manual LinkedBlockingQueue to HikariCP
  (eliminates isValid() ping on every query visible in Spark profiler)
- Move ALL DB writes off server thread: logout uses snapshot+async+latch,
  shutdown uses snapshot+CompletableFuture.allOf for parallel saves
- Pre-read curios/accessories/cosmeticarmor/attachments on background
  thread during login (4-7 fewer DB queries on main thread per login)
- Auto-save interval increased to 5 minutes
- Fix pool shutdown ordering: shutdownPool() now runs AFTER all shutdown
  saves complete (previously could fire before, silently losing all data)
- Fix connection leak in executeQuery/executePreparedQuery when
  prepareStatement throws (leaked connections exhaust HikariCP pool)
- Fix duplication bug: saveStorageContents guard used nbt.size()<=1 which
  blocked legitimately emptied backpacks from saving to DB
- Fix stale SaveToFile overwriting logout: check playerLocks.containsKey
  before writing to prevent stale background task from regressing data
- Remove LIMIT 1000 on startup online=0 reset (could leave players stuck)
- Add executorService.shutdown() on server stop to prevent JVM hang
- Add apply methods (applyCuriosFromData, applyAccessoriesFromData, etc.)
  to separate entity writes from DB reads for thread-safe restore
- Add UUID collectors (collectBackpackUuids, collectSSUuids) and
  background save methods for snapshot+async logout/shutdown pattern
2026-03-29 18:58:27 +02:00
laforetbrut
4999c372ec perf: eliminate synchronous MySQL calls on server main thread
Root cause of lag (TPS 9-16, MSPT spikes to 4846ms):
PlayerEvent.SaveToFile triggered synchronous JDBC writes on the
server main thread every Minecraft autosave cycle. With 35 players
this caused hundreds of network round-trips to MySQL blocking the
tick loop for up to 4846ms (97x the 50ms limit).

Fixes applied:
- onPlayerSaveToFile: now fully async. Entity state is snapshotted
  on the main thread (pure memory ops, <1ms), then ALL DB writes are
  submitted to the background executor. Main thread never blocks on
  MySQL again.

- snapshotPlayerData: now captures ALL entity-dependent mod data
  (Curios, Accessories, CosmeticArmor, NeoForge attachments) on the
  main thread. Previously these were read from a background thread
  which is not thread-safe and could cause data corruption.

- writeSnapshotToDB: single method that writes all player data in one
  background pass: player_data + curios + mod_player_data.

- Auto-save background task: removed ModCompatSync.storeAll(player),
  storeSophisticatedBackpacks, storeSophisticatedStorageItems,
  storeRefinedStorageDisks from background thread. These all accessed
  entity state off-thread. Mod compat data is now in the main-thread
  snapshot; backpack/SS/RS2 contents are saved on logout/shutdown.

- Added ModCompatSync snapshot API: snapshotAccessories(),
  snapshotCosmeticArmor(), snapshotAttachments(), writeModSnapshot()
  for clean separation of entity reads vs DB writes.
2026-03-27 14:15:29 +01:00
laforetbrut
04a1f0128e Optimize: move ALL DB writes off main thread + increase auto-save to 2min
Spark showed 5.66% server thread from auto-save. Breakdown:
- store() DB write: 1.39% (already moved to background)
- StoreCurios DB write: 0.56% (was on main thread)
- storeAccessories DB write: 0.55% (was on main thread)
- storeCosmeticArmor DB write: 0.56% (was on main thread)
- storeNeoForgeAttachments DB write: 0.58% (was on main thread)
- storeSophisticatedStorage: 0.69% (was on main thread)
- storeSophisticatedBackpacks: 0.59% (was on main thread)

Changes:
1. Curios snapshot: new snapshotCuriosData() reads entity state on
   main thread (fast), returns serialized string. DB write in background.
2. ALL mod saves moved to background thread lambda:
   - ModCompatSync.storeAll (Accessories, CosmeticArmor, Attachments)
   - Sophisticated Backpacks/Storage/RS2
3. Auto-save interval doubled: 1200 -> 2400 ticks (1min -> 2min)
4. Main thread now only does: entity snapshot (~0.3ms) + curios snapshot

Expected: ~80% reduction in main thread usage (5.66% -> ~1%)

Vyrriox
2026-03-26 22:17:25 +01:00
laforetbrut
7613f4ecfb Fix backpack/shulker contents lost on transfer: never overwrite DB with empty data
ROOT CAUSE: Sophisticated Backpacks/Storage wrappers cache inventory
in memory. When store() reads from BackpackStorage/ItemContentsStorage,
the SavedData may not have the latest wrapper state (unflushed changes).
This returns empty/default NBT which overwrites the real data in our DB.

Going back to the original server showed data because that server's
local SavedData still had the correct data (never overwritten).

FIX: saveStorageContents() now checks if the NBT is empty/minimal
before writing. If the DB already has substantial data (>50 bytes)
and the new NBT is empty, the save is SKIPPED to preserve the real
data. This prevents the empty-overwrite scenario while still allowing
legitimate saves of actual content.

Vyrriox
2026-03-26 22:09:51 +01:00
laforetbrut
e511414463 Final hardening: online=0 in finally + auto-save race fix
CRITICAL-1: online=0 moved to finally block in logout handler.
If store() threw an exception, online=0 was never written and the
player was permanently locked out of all servers.

CRITICAL-2: Same fix for shutdown handler. Any save failure during
shutdown left the player permanently stuck as online=1.

IMPORTANT: Auto-save background DB write now acquires tryLock()
before writing. If logout already saved newer data and holds/held
the lock, the stale auto-save snapshot is skipped. Prevents
overwriting correct logout data with an older snapshot.

Vyrriox
2026-03-26 22:06:38 +01:00
laforetbrut
1bf2a67e8d Optimize auto-save: snapshot on main thread, DB write on background
Spark showed 5.66% server thread from auto-save DB writes blocking
the tick loop (~1-2ms per player per query, ~8 queries per save).

New approach:
- snapshotPlayerData() captures ALL entity data into an immutable
  PlayerDataSnapshot record on the main thread (fast, no DB I/O)
- writeSnapshotToDB() writes the snapshot to DB on the background
  thread via executorService (slow DB I/O off main thread)
- Mod data (Curios, Accessories, CosmeticArmor, NeoForge attachments)
  still read entity state on main thread but their DB writes happen
  inline (they manage their own connections)
- Sophisticated Backpacks/Storage/RS2 saves happen during snapshot
  phase on main thread (they need entity access for inventory scan)

Expected: ~60-70% reduction in main thread blocking from auto-save.

Vyrriox
2026-03-26 21:31:43 +01:00
laforetbrut
d60b8eb01e Add connection pool - fix 10% server thread usage from MySQL connects
Spark showed PlayerSync consuming 10.16% of the server thread, almost
entirely from DriverManager.getConnection() (TCP handshake + MySQL auth
+ USE db) called for EVERY single query. With auto-save every 60s,
each player generated ~6 new connections per save cycle on main thread.

FIX: Simple connection pool (LinkedBlockingQueue, 5 connections).
- Connections are reused instead of opened/closed per query
- isValid(2) check before reuse to detect dead connections
- returnConnection() puts connections back in pool instead of closing
- QueryResult.close() also returns to pool
- autoReconnect=true in JDBC URL for resilience
- shutdownPool() for clean server stop
- Non-database connections (startup DDL) bypass the pool

Expected improvement: ~90% reduction in MySQL overhead on server thread.

Vyrriox
2026-03-26 21:13:17 +01:00
laforetbrut
e9620eb07e Fix RS2 restore: wrap entry in UUID key before codec decode
ROOT CAUSE from logs:
  "Invalid UUID capacity: Invalid UUID string: capacity"
  "Invalid UUID resources: Invalid UUID string: resources"

We saved the INNER storage data ({type, capacity, resources}) but the
map codec expects {uuid-string: {type, capacity, resources}}.
The codec tried to parse "capacity", "resources", "type" as UUIDs.

FIX: Wrap the stored NBT back in a UUID-keyed CompoundTag before
decoding: wrapped.put(uuid.toString(), storedNbt)

Also increased sync timeout from 15s to 60s - the server was 34s
behind (691 ticks) causing timeout errors for player sync.

Vyrriox
2026-03-26 20:53:23 +01:00
laforetbrut
12645a1d3d Fix RS2 restore: remove() before set() + reflection fallback
repo.set(uuid, storage) throws IllegalArgumentException if the UUID
already exists in the StorageRepository. This happens when a player
revisits a server where the disk was previously used.

Items appeared briefly (data was decoded correctly) but then the
exception prevented the set() and the storage fell back to empty.

Fix:
- Call repo.remove(uuid) before repo.set(uuid, storage)
- If set() still fails, inject directly into the entries map via
  reflection + mark SavedData dirty
- setDirty() ensures the injected data persists to disk

Vyrriox
2026-03-26 20:37:02 +01:00
laforetbrut
2baa8e4c39 Fix RS2: use createCodec() not getMapCodec() - wrong return type
ROOT CAUSE: getMapCodec(Runnable) returns MapCodec (not Codec).
createCodec(Runnable) returns Codec<Map<UUID, SerializableStorage>>.
Reflection on getMapCodec silently failed because the returned
MapCodec.decode() has a different signature than Codec.decode().

Both save fallback and restore codec paths now use createCodec().

RS2 uses ErrorHandlingMapCodec with UUIDUtil.STRING_CODEC for keys,
so the encoded format IS a CompoundTag with UUID strings as keys.

Vyrriox
2026-03-26 20:25:50 +01:00
laforetbrut
bce7a73cb8 Fix RS2 disk sync: use save() return value + codec reflection fallback
Save side:
- save() returns data in a NEW CompoundTag (fixed in previous commit)
- Now logs full NBT structure for debugging (describeNbtStructure)
- If UUID not found in save() NBT, falls back to reflection on
  internal entries map + codec.encodeStart() to serialize directly

Restore side:
- Rewritten to use raw Codec types to avoid generic compilation issues
- Decodes stored NBT via the same map codec, then repo.set() to inject

Both sides now have comprehensive logging to diagnose any remaining
format issues in production.

Vyrriox
2026-03-26 20:14:26 +01:00
laforetbrut
4e2574a147 Fix RS2 disk save: use return value of SavedData.save()
save() returns the serialized data in a NEW CompoundTag - it does NOT
fill the input parameter. We were passing an empty tag and reading it
back, getting nothing. The actual data was in the return value.

Log showed: "RS2 disk UUID xxx exists in repo but NOT found in save()
NBT. Keys at top: []" - empty because we ignored the return value.

Vyrriox
2026-03-26 19:30:27 +01:00
laforetbrut
50c77f7bb8 Fix last 2 audit issues: syncNotCompleted race + SaveToFile off-thread
BUG 1 - syncNotCompletedPlayer race condition:
  syncNotCompletedPlayer.add() was inside the background thread body.
  A player disconnecting instantly before the thread starts bypasses
  the "sync not completed" guard in onPlayerLogout, causing store()
  to read invalid entity state.
  FIX: add() moved to onPlayerJoin BEFORE executorService.submit().

BUG 2 - doPlayerSaveToFile off main thread:
  onPlayerSaveToFile wrapped doPlayerSaveToFile in executorService,
  but SaveToFile already fires on the main thread. store() reads
  player inventory/armor/effects from a background thread = corruption.
  FIX: Call doPlayerSaveToFile directly (no executor). Same fix as
  auto-save and logout paths.

Vyrriox
2026-03-26 19:17:16 +01:00