Match messages in logs (every line would be required to be present in log output Copy from "Messages before crash" column below): | |
Match messages in full crash (every line would be required to be present in crash log output Copy from "Full Crash" column below): | |
Limit to a test: (Copy from below "Failing text"): | |
Delete these reports as invalid (real bug in review or some such) | |
Bug or comment: | |
Extra info: |
Failing Test | Full Crash | Messages before crash | Comment |
---|---|---|---|
sanity-sec test 19: test nodemap trusted_admin fileops | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) dm_mod zfs(POE) spl(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev i2c_piix4 virtio_balloon pcspkr ext4 mbcache jbd2 ata_generic crc32c_intel ata_piix virtio_net libata serio_raw virtio_blk net_failover failover [last unloaded: obdecho] CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.58.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb3360074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 00000000905cf867 RBX: ffffeac9024173c0 RCX: 0000000000000200 RDX: 7fffffff6fa30798 RSI: ffff90a6505cf000 RDI: ffff90a6643cf000 RBP: 000055f43a9cf000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 0000000000000000 R12: ffff90a6012d1e78 R13: ffff90a613268000 R14: ffffeac90290f3c0 R15: ffff90a5c5dca740 FS: 0000000000000000(0000) GS:ffff90a67fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff9aba24000 CR3: 0000000051810006 CR4: 00000000000606f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL -------- - - 4.18.0-553.58.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.*.identity_upcall=NONE Lustre: 258578:0:(mdt_lproc.c:311:identity_upcall_store()) lustre-MDT0001: disable "identity_upcall" with ACL enabled maybe cause unexpected "EACCESS" Lustre: 258578:0:(mdt_lproc.c:311:identity_upcall_store()) Skipped 1 previous similar message Lustre: 210773:0:(nodemap_handler.c:3055:nodemap_create()) adding nodemap 'c0' to config without default nodemap Lustre: 210773:0:(nodemap_handler.c:3055:nodemap_create()) Skipped 26 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.active Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c1.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c1.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c1.admin_nodemap | Link to test |
sanity-quota test 16b: lfs quota should skip the nonexistent MDT/OST | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon i2c_piix4 joydev pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel libata serio_raw virtio_net net_failover failover virtio_blk [last unloaded: dm_flakey] CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffc07300753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000003ca86867 RBX: ffffec46c0f2a180 RCX: 0000000000000200 RDX: 7fffffffc3579798 RSI: ffff9e41fca86000 RDI: ffff9e420c5f7000 RBP: 000055cbc1bf7000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffee R12: ffff9e41f557afb8 R13: ffff9e41f6655000 R14: ffffec46c1317dc0 R15: ffff9e41c5d7ae80 FS: 0000000000000000(0000) GS:ffff9e427fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f843eecc48 CR3: 000000000e810004 CR4: 00000000001706e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1 Lustre: lustre-MDT0000: Not available for connect from 10.240.30.80@tcp (stopping) Lustre: Skipped 11 previous similar messages Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping) Lustre: server umount lustre-MDT0000 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; LustreError: 137-5: lustre-MDT0000: not available for connect from 10.240.30.80@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 18 previous similar messages LustreError: 107302:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) ldlm_cancel from 10.240.28.51@tcp arrived at 1752068514 with bad export cookie 8652668417692940817 LustreError: 107302:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) Skipped 4 previous similar messages LustreError: 137-5: lustre-MDT0000: not available for connect from 10.240.30.80@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 11 previous similar messages Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3 LustreError: 107302:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1752068521 with bad export cookie 8652668417692940075 LustreError: 166-1: MGC10.240.28.44@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Lustre: lustre-MDT0002: Not available for connect from 10.240.28.51@tcp (stopping) Lustre: Skipped 5 previous similar messages Lustre: server umount lustre-MDT0002 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=1981808 --mkfsoptions="-O ea_inode,large_dir" --index=0 --reformat /dev/mapper/mds1_flakey LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: errors=remount-ro Autotest: Test running for 210 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28) Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=onyx-99vm1@tcp --fsname=lustre --mdt --index=2 --param=sys.timeout=20 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=1981808 --mkfsoptions="-O ea_inode,large_dir" --index=100 --reformat /dev/ma LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1 Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey 2>&1 Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1 LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: errors=remount-ro LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Lustre: Setting parameter lustre-MDT0000.mdt.identity_upcall in log lustre-MDT0000 Lustre: Skipped 3 previous similar messages Lustre: ctl-lustre-MDT0000: No data found on store. Initialize space: rc = -61 Lustre: lustre-MDT0000: new disk, initializing Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180 Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' Lustre: DEBUG MARKER: sync; sleep 1; sync Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null Lustre: Modifying parameter lustre-MDT0001.mdt.identity_upcall in log lustre-MDT0001 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm8.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-99vm8.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds3 Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds3_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds3_flakey 2>&1 Lustre: DEBUG MARKER: test -b /dev/mapper/mds3_flakey Lustre: DEBUG MARKER: e2label /dev/mapper/mds3_flakey Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds3; mount -t lustre -o localrecov /dev/mapper/mds3_flakey /mnt/lustre-mds3 LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: errors=remount-ro LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Lustre: Modifying parameter lustre-MDT0064.mdt.identity_upcall in log lustre-MDT0064 Lustre: 706434:0:(mgc_request.c:1926:mgc_llog_local_copy()) MGC10.240.28.44@tcp: no remote llog for lustre-sptlrpc, check MGS config Lustre: srv-lustre-MDT0064: No data found on store. Initialize space: rc = -61 Lustre: Skipped 1 previous similar message Lustre: lustre-MDT0064: new disk, initializing Lustre: lustre-MDT0064: Imperative Recovery not enabled, recovery window 60-180 Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000280000400-0x00000002c0000400]:64:mdt Lustre: Skipped 1 previous similar message Lustre: cli-ctl-lustre-MDT0064: Allocated super-sequence [0x0000000280000400-0x00000002c0000400]:64:mdt] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' Lustre: DEBUG MARKER: sync; sleep 1; sync Lustre: DEBUG MARKER: e2label /dev/mapper/mds3_flakey 2>/dev/null Lustre: lustre-OST0000-osc-MDT0000: update sequence from 0x100000000 to 0x2c0000401 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-145vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-145vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0064-osc-MDT0000: update sequence from 0x100640000 to 0x300000401 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-145vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-145vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_* Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osp.*.destroys_in_flight Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_* Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_* Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_* Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_* Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_* Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_* Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osp.*.destroys_in_flight Lustre: DEBUG MARKER: lctl set_param -n os[cd]*.*MDT*.force_sync=1 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: lfs --list-commands Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: lfs --list-commands Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1 Lustre: lustre-MDT0000: Not available for connect from 10.240.28.51@tcp (stopping) Lustre: Skipped 16 previous similar messages Lustre: server umount lustre-MDT0000 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && LustreError: 137-5: lustre-MDT0000: not available for connect from 10.240.30.80@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 33 previous similar messages Lustre: DEBUG MARKER: modprobe dm-flakey; LustreError: 704436:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) ldlm_cancel from 10.240.28.51@tcp arrived at 1752069008 with bad export cookie 8652668417695125167 LustreError: 704436:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) Skipped 8 previous similar messages Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3 Lustre: 13559:0:(client.c:2355:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1752069008/real 1752069008] req@000000000bc8652e x1837164635120448/t0(0) o400->MGC10.240.28.44@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1752069015 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker.0' LustreError: 166-1: MGC10.240.28.44@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Autotest: Test running for 215 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28) Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-145vm9.onyx.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: onyx-145vm9.onyx.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm8.onyx.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: onyx-99vm8.onyx.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: [ -e /dev/mapper/mds1_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=1981808 --mkfsoptions="-O ea_inode,large_dir" --reformat /dev/mapper/mds1_flakey LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: errors=remount-ro Autotest: Test running for 220 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28) Lustre: DEBUG MARKER: [ -e /dev/mapper/mds3_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=onyx-99vm1@tcp --fsname=lustre --mdt --index=2 --param=sys.timeout=20 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=1981808 --mkfsoptions="-O ea_inode,large_dir" --reformat /dev/mapper/mds3_fl Autotest: Test running for 225 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28) INFO: task mke2fs:713312 blocked for more than 120 seconds. Tainted: G OE -------- - - 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:mke2fs state:D stack:0 pid:713312 ppid:713311 flags:0x00004080 Call Trace: __schedule+0x2d1/0x870 ? blk_flush_plug_list+0xd7/0x100 ? wbt_exit+0x30/0x30 ? __wbt_done+0x40/0x40 schedule+0x55/0xf0 io_schedule+0x12/0x40 rq_qos_wait+0xb3/0x130 ? karma_partition+0x1f0/0x1f0 ? wbt_exit+0x30/0x30 wbt_wait+0x96/0xc0 __rq_qos_throttle+0x23/0x40 blk_mq_make_request+0x131/0x5c0 generic_make_request_no_check+0xe1/0x330 submit_bio+0x3c/0x160 blk_next_bio+0x33/0x40 __blkdev_issue_zero_pages+0x90/0x190 blkdev_issue_zeroout+0xef/0x222 blkdev_fallocate+0x13f/0x1a0 vfs_fallocate+0x140/0x280 ksys_fallocate+0x3c/0x80 __x64_sys_fallocate+0x1a/0x30 do_syscall_64+0x5b/0x1a0 entry_SYSCALL_64_after_hwframe+0x66/0xcb RIP: 0033:0x7f18c8bfa62b Code: Unable to access opcode bytes at RIP 0x7f18c8bfa601. RSP: 002b:00007ffda4b24558 EFLAGS: 00000246 ORIG_RAX: 000000000000011d RAX: ffffffffffffffda RBX: 00005614afb868e0 RCX: 00007f18c8bfa62b RDX: 000000004f512000 RSI: 0000000000000010 RDI: 0000000000000003 RBP: 000000004f512000 R08: 00007ffda4b246bc R09: 0000000000000000 R10: 00000000116b8000 R11: 0000000000000246 R12: 00000000116b8000 R13: 0000000000000003 R14: 00000000000116b8 R15: 00005614afb86760 Autotest: Test running for 230 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28) LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: errors=remount-ro Autotest: Test running for 240 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28) Autotest: Test running for 245 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28) Autotest: Test running for 250 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28) | Link to test |
sanityn test complete, duration 6169 sec | watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache i2c_piix4 virtio_balloon intel_rapl_msr intel_rapl_common joydev pcspkr crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_blk serio_raw virtio_net net_failover failover [last unloaded: libcfs] CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.53.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffae2280753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 00000000132b0845 RBX: ffffdc49804cac00 RCX: 0000000000000200 RDX: 7fffffffecd4f7ba RSI: ffff8fd3932b0000 RDI: ffff8fd3f6cdb000 RBP: 0000565121adb000 R08: 00000000000396d8 R09: 00000000000396d0 R10: 0000000000000007 R11: 00000000fffffff9 R12: ffff8fd383cd56d8 R13: ffff8fd39c04a800 R14: ffffdc4981db36c0 R15: ffff8fd3b533a828 FS: 0000000000000000(0000) GS:ffff8fd43fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f260b90a3c8 CR3: 000000001a610006 CR4: 00000000000606f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 ? mutex_lock+0xe/0x30 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.53.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark === sanityn: start cleanup 19:21:11 \(1750620071\) === Lustre: DEBUG MARKER: === sanityn: start cleanup 19:21:11 (1750620071) === Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts); LustreError: 757794:0:(lov_obd.c:784:lov_cleanup()) lustre-clilov-ffff8fd382c68800: lov tgt 0 not cleaned! deathrow=0, lovrc=1 LustreError: 757794:0:(obd_class.h:481:obd_check_dev()) Device 28 not setup Lustre: Unmounted lustre-client Lustre: DEBUG MARKER: /usr/sbin/lctl mark === sanityn: finish cleanup 19:21:58 \(1750620118\) === Lustre: DEBUG MARKER: === sanityn: finish cleanup 19:21:58 (1750620118) === Lustre: 749114:0:(client.c:2453:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1750620139/real 0] req@ffff8fd383ec9380 x1835657860501376/t0(0) o400->lustre-OST0000-osc-ffff8fd382c6b800@10.240.43.221@tcp:28/4 lens 224/224 e 0 to 1 dl 1750620155 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 projid:4294967295 Lustre: 749114:0:(client.c:2453:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: lustre-OST0000-osc-ffff8fd382c6b800: Connection to lustre-OST0000 (at 10.240.43.221@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: MGC10.240.40.114@tcp: Connection to MGS (at 10.240.40.114@tcp) was lost; in progress operations using this service will fail | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [khugepaged:34] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc dm_mod ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover virtio_blk failover CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffa06dc0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000012dba1865 RBX: ffffdf6d04b6e840 RCX: 0000000000000200 RDX: 7ffffffed245e79a RSI: ffff918cadba1000 RDI: ffff918c1b879000 RBP: 000055e720679000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff7 R12: ffff918c8479c3c8 R13: ffff918cbbd8a800 R14: ffffdf6d026e1e40 R15: ffff918c8299e910 FS: 0000000000000000(0000) GS:ffff918cbbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f68d156c024 CR3: 000000009ae10006 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 20 minutes (lustre-master_full-part-1_4624.118) Autotest: Test running for 25 minutes (lustre-master_full-part-1_4624.118) Autotest: Test running for 30 minutes (lustre-master_full-part-1_4624.118) Autotest: Test running for 35 minutes (lustre-master_full-part-1_4624.118) Autotest: Test running for 40 minutes (lustre-master_full-part-1_4624.118) Autotest: Test running for 45 minutes (lustre-master_full-part-1_4624.118) Autotest: Test running for 55 minutes (lustre-master_full-part-1_4624.118) Lustre: 7633:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1745406313/real 1745406313] req@00000000ec651fcf x1830187634946496/t0(0) o400->MGC10.240.27.90@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1745406320 ref 1 fl Rpc:RXNQ/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' LustreError: 166-1: MGC10.240.27.90@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.240.27.84@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: MGS: Client e6a52ee6-f037-47b7-a199-72f5f3e858db (at 0@lo) reconnecting Lustre: lustre-MDT0000: Received new LWP connection from 0@lo, keep former export from same NID Lustre: lustre-MDT0000-lwp-MDT0000: Connection restored to 10.240.27.90@tcp (at 0@lo) Lustre: lustre-OST0000-osc-MDT0000: Connection restored to (at 10.240.27.84@tcp) Lustre: Skipped 1 previous similar message | Link to test |
sanity test 900: umount should not race with any mgc requeue thread | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:35] Modules linked in: dm_flakey osp(OE) ofd(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 intel_rapl_msr dns_resolver nfs lockd grace fscache intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon i2c_piix4 joydev sunrpc pcspkr dm_mod ext4 mbcache jbd2 ata_generic crc32c_intel serio_raw ata_piix virtio_net libata virtio_blk net_failover failover [last unloaded: obdecho] CPU: 0 PID: 35 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.46.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffb6948075bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000059c00867 RBX: fffffa8481670000 RCX: 0000000000000200 RDX: 7fffffffa63ff798 RSI: ffff92e2d9c00000 RDI: ffff92e293d50000 RBP: 000055f7e3d50000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffffe R12: ffff92e2c8a1fa80 R13: ffff92e2cc662800 R14: fffffa84804f5400 R15: ffff92e2b52622b8 FS: 0000000000000000(0000) GS:ffff92e33fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f7e3e101f0 CR3: 000000004ac10005 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 35 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.46.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 rcu: INFO: rcu_sched detected stalls on CPUs/tasks: rcu: 0-...0: (20250 ticks this GP) idle=542/1/0x4000000000000002 softirq=7342043/7342043 fqs=1902 (detected by 1, t=60002 jiffies, g=10469413, q=1002) hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 Sending NMI from CPU 1 to CPUs 0: apic_timer_interrupt+0xf/0x20 NMI backtrace for cpu 0 CPU: 0 PID: 35 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.46.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 </IRQ> RIP: 0010:__orc_find+0x44/0x80 | Autotest: Test running for 385 minutes (lustre-master_rolling-upgrade-mds_4622.145) Lustre: lustre-MDT0000-lwp-OST0002: Connection to lustre-MDT0000 (at 10.240.28.195@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 6 previous similar messages LustreError: MGC10.240.28.195@tcp: Connection to MGS (at 10.240.28.195@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.28.195@tcp (at 10.240.28.195@tcp) Lustre: Skipped 7 previous similar messages Lustre: Evicted from MGS (at 10.240.28.195@tcp) after server handle changed from 0x6e639d631a8049c to 0x6e639d631a9092b Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-106vm12.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-106vm12.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0003: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x300000400:75022 to 0x300000400:75105) Lustre: lustre-OST0000: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x240000400:240362 to 0x240000400:249377) Lustre: lustre-OST0001: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x280000400:75064 to 0x280000400:75169) Lustre: lustre-OST0002: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x2c0000400:74989 to 0x2c0000400:75073) Lustre: lustre-OST0005: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x380000400:74979 to 0x380000400:75009) Lustre: lustre-OST0004: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x340000400:74957 to 0x340000400:75041) Lustre: lustre-OST0006: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x3c0000400:74979 to 0x3c0000400:75009) Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm4.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm5.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-46vm5.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-46vm4.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: 57726:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982356/real 1744982356] req@00000000bbcb7d2c x1829702233358848/t0(0) o400->lustre-MDT0000-lwp-OST0006@10.240.28.195@tcp:12/10 lens 224/224 e 0 to 1 dl 1744982401 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: 57726:0:(client.c:2346:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: 57726:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982361/real 1744982361] req@0000000006b7951d x1829702233359232/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.28.195@tcp:12/10 lens 224/224 e 0 to 1 dl 1744982406 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: 57726:0:(client.c:2346:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-MDT0000-lwp-OST0004: Connection to lustre-MDT0000 (at 10.240.28.195@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 5 previous similar messages Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1 LustreError: 1251351:0:(client.c:1282:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@000000008b84d2ed x1829702233388928/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.240.28.195@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:QU/200/ffffffff rc 0/-1 job:'qsd_reint_0.lus.0' uid:0 gid:0 LustreError: 1251351:0:(qsd_reint.c:38:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5 Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982443/real 1744982443] req@000000002dcadbe5 x1829702233386752/t0(0) o400->MGC10.240.28.195@tcp@10.240.28.195@tcp:26/25 lens 224/224 e 0 to 1 dl 1744982459 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) Skipped 13 previous similar messages LustreError: MGC10.240.28.195@tcp: Connection to MGS (at 10.240.28.195@tcp) was lost; in progress operations using this service will fail LustreError: 1251348:0:(obd_class.h:479:obd_check_dev()) Device 4 not setup LustreError: 1251348:0:(obd_class.h:479:obd_check_dev()) Skipped 1 previous similar message Lustre: server umount lustre-OST0000 complete Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982443/real 1744982443] req@00000000872a75be x1829702233387392/t0(0) o400->lustre-MDT0000-lwp-OST0005@10.240.28.195@tcp:12/10 lens 224/224 e 0 to 1 dl 1744982488 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: lustre-MDT0000-lwp-OST0005: Connection to lustre-MDT0000 (at 10.240.28.195@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 2 previous similar messages Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982478/real 1744982478] req@00000000e9775cea x1829702233391488/t0(0) o400->lustre-MDT0000-lwp-OST0005@10.240.28.195@tcp:12/10 lens 224/224 e 0 to 1 dl 1744982523 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: dmsetup table /dev/mapper/ost1_flakey Lustre: DEBUG MARKER: dmsetup remove /dev/mapper/ost1_flakey Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1 | Link to test |
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev i2c_piix4 pcspkr virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover failover serio_raw virtio_blk CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Not tainted 4.18.0-553.46.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb0c100753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000011addb867 RBX: fffff094c46b76c0 RCX: 0000000000000200 RDX: 7ffffffee5224798 RSI: ffff9b6e5addb000 RDI: ffff9b6e43c8d000 RBP: 000056361488d000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffffc R12: ffff9b6e501a2468 R13: ffff9b6e7bd88000 R14: fffff094c40f2340 R15: ffff9b6e5f174910 FS: 0000000000000000(0000) GS:ffff9b6e7bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055728771cbfc CR3: 0000000010c10006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G L -------- - - 4.18.0-553.46.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Link to test | ||
sanity-pcc test 3a: Repeat attach/detach operations | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:36] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon joydev pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover virtio_blk failover CPU: 1 PID: 36 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.46.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa4664075bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000043d48867 RBX: ffffd9dc010f5200 RCX: 0000000000000200 RDX: 7fffffffbc2b7798 RSI: ffff92bd83d48000 RDI: ffff92bd85d7f000 RBP: 000055b09f57f000 R08: ffff92bdffd00000 R09: ffffffffad5c6880 R10: 0000000000000007 R11: 00000000ffffffeb R12: ffff92bd7ae98bf8 R13: ffff92bdffc50000 R14: ffffd9dc01175fc0 R15: ffff92bd7aeb5cb0 FS: 0000000000000000(0000) GS:ffff92bdffd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fff507f8b78 CR3: 000000004d010006 CR4: 00000000001706e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 36 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.46.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl set_param debug=-1 debug_mb=150 LustreError: 64168:0:(mdt_open.c:1883:mdt_orphan_open()) lustre-MDT0001: cannot create volatile file [0x2400032e3:0x3:0x0]: rc = -11 LustreError: 64168:0:(mdt_open.c:2133:mdt_hsm_release()) lustre-MDT0001: cannot open orphan file [0x2400032e3:0x3:0x0]: rc = -11 Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-pcc test_3a: @@@@@@ FAIL: failed to attach file \/mnt\/lustre\/d3a.sanity-pcc\/f3a.sanity-pcc Lustre: DEBUG MARKER: sanity-pcc test_3a: @@@@@@ FAIL: failed to attach file /mnt/lustre/d3a.sanity-pcc/f3a.sanity-pcc Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2025-04-09/lustre-b_es-reviews_review-dne-exa6-part-1_22924_72_0c4f65af-8d13-4971-abc6-a6eadfa505a4//sanity-pcc.test_3a.debug_log.$(hostname -s).1744205437.log; Autotest: Test running for 40 minutes (lustre-b_es-reviews_review-dne-exa6-part-1_22924.72) | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34] Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul virtio_balloon ghash_clmulni_intel joydev pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw virtio_blk failover CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.44.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffbed840753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000001f6bb865 RBX: fffff169007daec0 RCX: 0000000000000200 RDX: 7fffffffe094479a RSI: ffff9d779f6bb000 RDI: ffff9d77abd9e000 RBP: 0000562c5539e000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffffa R12: ffff9d77850e0cf0 R13: ffff9d77a2c6d000 R14: fffff16900af6780 R15: ffff9d7784568000 FS: 0000000000000000(0000) GS:ffff9d783fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe3b358f024 CR3: 0000000021210002 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.44.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_112217.11) Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_112217.11) Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_112217.11) Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_112217.11) Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_112217.11) Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_112217.11) Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_112217.11) | Link to test |
sanity test 314: OSP shouldn't fail after last_rcvd update failure | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [khugepaged:34] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) ib_core tcp_diag inet_diag loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_blk virtio_net net_failover failover [last unloaded: libcfs] CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.44.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 1.16.0-4.module+el8.8.0+1454+0b2cbfb8 04/01/2014 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffad1800753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000021646865 RBX: ffffcdc900859180 RCX: 0000000000000200 RDX: 7fffffffde9b979a RSI: ffff9fcba1646000 RDI: ffff9fcb92bac000 RBP: 0000564185dac000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff0 R12: ffff9fcb8447bd60 R13: ffff9fcbe765a800 R14: ffffcdc9004aeb00 R15: ffff9fcbb5951ae0 FS: 0000000000000000(0000) GS:ffff9fcc3fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005641864c68e0 CR3: 0000000065c10005 CR4: 0000000000170ef0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.44.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 1.16.0-4.module+el8.8.0+1454+0b2cbfb8 04/01/2014 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 160 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 165 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 170 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 175 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 180 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 185 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 190 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 195 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 200 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 205 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 210 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 215 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 220 minutes (lustre-reviews_custom_112026.1004) Autotest: Test running for 225 minutes (lustre-reviews_custom_112026.1004) Lustre: 468035:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1743103738/real 0] req@ffff9fcbef680000 x1827770860516864/t0(0) o400->lustre-MDT0000-mdc-ffff9fcb8601b800@10.240.26.6@tcp:12/10 lens 224/224 e 0 to 1 dl 1743103754 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: lustre-MDT0000-mdc-ffff9fcb8601b800: Connection to lustre-MDT0000 (at 10.240.26.6@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: MGC10.240.26.6@tcp: Connection to MGS (at 10.240.26.6@tcp) was lost; in progress operations using this service will fail | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr virtio_balloon intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 joydev pcspkr sunrpc ext4 mbcache jbd2 ata_generic crc32c_intel ata_piix virtio_net serio_raw libata virtio_blk net_failover failover CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.40.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb4070074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000019441867 RBX: ffffed2940651040 RCX: 0000000000000200 RDX: 7fffffffe6bbe798 RSI: ffff9b49d9441000 RDI: ffff9b4a183b3000 RBP: 000056237cbb3000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffe9 R12: ffff9b49fb423d98 R13: ffff9b4a5da78000 R14: ffffed294160ecc0 R15: ffff9b49c62de910 FS: 0000000000000000(0000) GS:ffff9b4a7fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056238066ce88 CR3: 000000009c010004 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.40.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_111584.11) Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_111584.11) Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_111584.11) Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_111584.11) Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_111584.11) Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_111584.11) Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_111584.11) | Link to test |
lustre-rsync-test test 3b: Replicate files created by writemany | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon i2c_piix4 sunrpc joydev pcspkr ext4 mbcache jbd2 ata_generic virtio_net crc32c_intel ata_piix libata serio_raw virtio_blk net_failover failover CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffff9b7780753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000001ef27867 RBX: ffffcc75007bc9c0 RCX: 0000000000000200 RDX: 7fffffffe10d8798 RSI: ffff899d1ef27000 RDI: ffff899d31bdf000 RBP: 000055e8301df000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffef R12: ffff899d05b40ef8 R13: ffff899d95448000 R14: ffffcc7500c6f7c0 R15: ffff899d30762e80 FS: 0000000000000000(0000) GS:ffff899dbfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f119fcc8024 CR3: 0000000093a10003 CR4: 00000000000606f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 25 minutes (lustre-reviews_review-dne-zfs-part-5_109672.16) Lustre: lustre-OST0000-osc-ffff899d04463800: disconnect after 27s idle Lustre: Skipped 3 previous similar messages Autotest: Test running for 30 minutes (lustre-reviews_review-dne-zfs-part-5_109672.16) Autotest: Test running for 35 minutes (lustre-reviews_review-dne-zfs-part-5_109672.16) Lustre: 8596:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1734193367/real 0] req@ffff899d2efc8d00 x1818431180456064/t0(0) o400->lustre-MDT0000-mdc-ffff899d04463800@10.240.26.26@tcp:12/10 lens 224/224 e 0 to 1 dl 1734193383 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: lustre-MDT0000-mdc-ffff899d04463800: Connection to lustre-MDT0000 (at 10.240.26.26@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: MGC10.240.26.26@tcp: Connection to MGS (at 10.240.26.26@tcp) was lost; in progress operations using this service will fail Lustre: 8595:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1734193372/real 0] req@ffff899d2efc8340 x1818431180456704/t0(0) o400->lustre-MDT0000-mdc-ffff899d04463800@10.240.26.26@tcp:12/10 lens 224/224 e 0 to 1 dl 1734193388 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: 8595:0:(client.c:2364:ptlrpc_expire_one_request()) Skipped 2 previous similar messages | Link to test |
conf-sanity test 56a: check big OST indexes and out-of-index-order start | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr i2c_piix4 virtio_balloon intel_rapl_common crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover failover virtio_blk serio_raw [last unloaded: libcfs] CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffbc4c00753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000003c583867 RBX: ffffebccc0f160c0 RCX: 0000000000000200 RDX: 7fffffffc3a7c798 RSI: ffff9c86bc583000 RDI: ffff9c86daf83000 RBP: 0000556609d83000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 0000000000000000 R12: ffff9c86c29b3c18 R13: ffff9c86a9a60000 R14: ffffebccc16be0c0 R15: ffff9c8683a292b8 FS: 0000000000000000(0000) GS:ffff9c873fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f32fb6b9d05 CR3: 0000000028010004 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost8' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: [ -e /dev/mapper/ost1_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost1_flakey LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost2_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=1 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost2_flakey LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost3_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=2 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost3_flakey LDISKFS-fs (dm-13): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost4_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=3 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost4_flakey LDISKFS-fs (dm-14): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost5_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=4 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost5_flakey LDISKFS-fs (dm-15): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost6_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=5 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost6_flakey LDISKFS-fs (dm-16): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost7_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=6 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost7_flakey LDISKFS-fs (dm-17): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost8_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost8' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=7 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost8_flakey LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --index=10000 --reformat /dev/mapper/ost1_flakey LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=1 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --index=1000 --reformat /dev/mapper/ost2_flakey LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1 Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey 2>&1 Lustre: DEBUG MARKER: test -b /dev/mapper/ost1_flakey Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1; mount -t lustre -o localrecov /dev/mapper/ost1_flakey /mnt/lustre-ost1 LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: user_xattr,acl,no_mbcache,nodelalloc Lustre: 631343:0:(mgc_request_server.c:553:mgc_llog_local_copy()) MGC10.240.43.6@tcp: no remote llog for lustre-sptlrpc, check MGS config Lustre: lustre-OST2710: new disk, initializing Lustre: srv-lustre-OST2710: No data found on store. Initialize space. Lustre: lustre-OST2710: Imperative Recovery not enabled, recovery window 60-180 Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl set_param seq.cli-lustre-OST2710-super.width=16384 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' Lustre: DEBUG MARKER: sync; sleep 1; sync Lustre: cli-lustre-OST2710-super: Allocated super-sequence [0x0000000300000400-0x0000000340000400]:2710:ost] Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST2710-osc-[-0-9a-f]\*.ost_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST2710-osc-[-0-9a-f]\*.ost_server_uuid Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST2710-osc-[-0-9a-f]*.ost_server_uuid Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST2710-osc-[-0-9a-f]*.ost_server_uuid Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2 Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost2_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost2_flakey 2>&1 Lustre: DEBUG MARKER: test -b /dev/mapper/ost2_flakey Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2; mount -t lustre -o localrecov /dev/mapper/ost2_flakey /mnt/lustre-ost2 LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: errors=remount-ro LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: user_xattr,acl,no_mbcache,nodelalloc Lustre: lustre-OST03e8: new disk, initializing Lustre: srv-lustre-OST03e8: No data found on store. Initialize space. Lustre: cli-lustre-OST03e8-super: Allocated super-sequence [0x0000000340000400-0x0000000380000400]:3e8:ost] Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl set_param seq.cli-lustre-OST03e8-super.width=16384 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST03e8-osc-[-0-9a-f]\*.ost_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST03e8-osc-[-0-9a-f]\*.ost_server_uuid Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST03e8-osc-[-0-9a-f]*.ost_server_uuid Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST03e8-osc-[-0-9a-f]*.ost_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0000.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0000.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST2710-osc-MDT0000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST2710-osc-MDT0000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0001.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0001.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST2710-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST2710-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0002.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0002.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST2710-osc-MDT0002.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST2710-osc-MDT0002.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0003.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0003.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST2710-osc-MDT0003.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST2710-osc-MDT0003.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0000.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0000.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST03e8-osc-MDT0000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST03e8-osc-MDT0000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0001.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0001.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST03e8-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST03e8-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0002.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0002.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST03e8-osc-MDT0002.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST03e8-osc-MDT0002.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0003.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0003.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST03e8-osc-MDT0003.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST03e8-osc-MDT0003.ost_server_uuid in FULL state after 0 sec Lustre: lustre-MDT0000-lwp-OST2710: Connection to lustre-MDT0000 (at 10.240.43.6@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: lustre-MDT0002-lwp-OST2710: Connection to lustre-MDT0002 (at 10.240.43.6@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 3 previous similar messages Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1 Lustre: 583597:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1734163919/real 1734163919] req@00000000dc73e880 x1818401634843648/t0(0) o400->MGC10.240.43.6@tcp@10.240.43.6@tcp:26/25 lens 224/224 e 0 to 1 dl 1734163935 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'' uid:0 gid:0 LustreError: MGC10.240.43.6@tcp: Connection to MGS (at 10.240.43.6@tcp) was lost; in progress operations using this service will fail Autotest: Test running for 265 minutes (lustre-reviews_review-dne-part-3_109662.5) Lustre: server umount lustre-OST2710 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost2 Lustre: server umount lustre-OST03e8 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost8' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost8' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing set_hostid Lustre: DEBUG MARKER: [ -e /dev/mapper/ost1_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost1_flakey LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost2_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=1 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost2_flakey LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost3_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=2 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost3_flakey LDISKFS-fs (dm-13): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost4_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=3 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost4_flakey LDISKFS-fs (dm-14): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost5_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=4 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost5_flakey LDISKFS-fs (dm-15): mounted filesystem with ordered data mode. Opts: errors=remount-ro Autotest: Test running for 270 minutes (lustre-reviews_review-dne-part-3_109662.5) Lustre: DEBUG MARKER: [ -e /dev/mapper/ost6_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=5 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost6_flakey LDISKFS-fs (dm-16): mounted filesystem with ordered data mode. Opts: errors=remount-ro Lustre: DEBUG MARKER: [ -e /dev/mapper/ost7_flakey ] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=6 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost7_flakey LDISKFS-fs (dm-17): mounted filesystem with ordered data mode. Opts: errors=remount-ro | Link to test |
sanity test 256: Check llog delete for empty and not full state | watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34] Modules linked in: dm_flakey obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover serio_raw failover virtio_blk [last unloaded: dm_flakey] CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G W OE -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffbd34c074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000006c10b867 RBX: ffffdd7701b042c0 RCX: 0000000000000200 RDX: 7fffffff93ef4798 RSI: ffff93b6ac10b000 RDI: ffff93b6b1cc7000 RBP: 000055fa91ec7000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff4 R12: ffff93b686de9638 R13: ffff93b6dac70000 R14: ffffdd7701c731c0 R15: ffff93b679568bc8 FS: 0000000000000000(0000) GS:ffff93b6ffd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055fa9417a478 CR3: 0000000099210004 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G W OEL -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users Lustre: DEBUG MARKER: /usr/sbin/lctl get_param mdd.lustre-MDT0000.changelog_mask -n Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdd.lustre-MDT0000.changelog_mask=+hsm Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 changelog register -n Lustre: lustre-MDD0000: changelog on Autotest: Test running for 245 minutes (lustre-reviews_review-ldiskfs-ubuntu_109580.43) | Link to test |
sanity test 65k: validate manual striping works properly with deactivated OSCs | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel virtio_net libata net_failover failover virtio_blk serio_raw CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffabb340753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000002336c867 RBX: ffffd2c3008cdb00 RCX: 0000000000000200 RDX: 7fffffffdcc93798 RSI: ffff9b03e336c000 RDI: ffff9b03eabcc000 RBP: 000055e3561cc000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffe9 R12: ffff9b03d026ae60 R13: ffff9b040024d000 R14: ffffd2c300aaf300 R15: ffff9b03c36473a0 FS: 0000000000000000(0000) GS:ffff9b047fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa1097f3420 CR3: 000000003e810005 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 60 minutes (lustre-reviews_review-dne-zfs-part-1_109517.12) Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0 Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=trace+inode+super+iotrace+malloc+cache+info+ioctl+neterror+net+warning+buffs+other+dentry+nettrace+page+dlmtrace+error+emerg+ha+rpctrace+vfstrace+reada+mmap+config+console+quota+sec+lfsck+hsm+snapshot+layout Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0 Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=trace+inode+super+iotrace+malloc+cache+info+ioctl+neterror+net+warning+buffs+other+dentry+nettrace+page+dlmtrace+error+emerg+ha+rpctrace+vfstrace+reada+mmap+config+console+quota+sec+lfsck+hsm+snapshot+layout Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-107vm16.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-107vm16.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-107vm17.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-107vm17.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-107vm16.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-107vm16.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-107vm17.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 50 Lustre: DEBUG MARKER: trevis-107vm17.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 50 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0 Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=trace+inode+super+iotrace+malloc+cache+info+ioctl+neterror+net+warning+buffs+other+dentry+nettrace+page+dlmtrace+error+emerg+ha+rpctrace+vfstrace+reada+mmap+config+console+quota+sec+lfsck+hsm+snapshot+layout Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0 | Link to test |
parallel-scale test write_disjoint_tiny: write_disjoint_tiny | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: ib_core mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr joydev virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_blk serio_raw virtio_net net_failover failover CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb3b540753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 00000000085fc867 RBX: ffffdfb200217f00 RCX: 0000000000000200 RDX: 7ffffffff7a03798 RSI: ffff8ec8085fc000 RDI: ffff8ec813ac8000 RBP: 00007f9cc0ac8000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff3 R12: ffff8ec92898c640 R13: ffff8ec93bc40000 R14: ffffdfb2004eb200 R15: ffff8ec939f0b570 FS: 0000000000000000(0000) GS:ffff8ec93bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9cc0379024 CR3: 0000000107810006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 395 minutes (lustre-master_full-part-1_4595.1) Autotest: Test running for 400 minutes (lustre-master_full-part-1_4595.1) Autotest: Test running for 405 minutes (lustre-master_full-part-1_4595.1) | Link to test |
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 joydev pcspkr virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic virtio_net ata_piix net_failover crc32c_intel libata failover virtio_blk serio_raw CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Not tainted 4.18.0-513.24.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffbbf580753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 00000000113ad825 RBX: fffff975c044eb40 RCX: 0000000000000200 RDX: 7fffffffeec527da RSI: ffffa094113ad000 RDI: ffffa0940e5ad000 RBP: 00005602a25ad000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 0000000000000000 R12: ffffa094047cdd68 R13: ffffa0947fc4a800 R14: fffff975c0396b40 R15: ffffa0940431a2b8 FS: 0000000000000000(0000) GS:ffffa094bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb66f573000 CR3: 000000007d610004 CR4: 00000000001706f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G L --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Link to test | ||
sanity test 77l: preferred checksum type is remembered after reconnected | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: obdecho(OE) ptlrpc_gss(OE) osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_blk virtio_net net_failover failover [last unloaded: lnet_selftest] CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffb3bd80753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000055ae7865 RBX: ffffeb784156b9c0 RCX: 0000000000000200 RDX: 7fffffffaa51879a RSI: ffff8cda15ae7000 RDI: ffff8cda5c3b6000 RBP: 000055cc40db6000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffff R12: ffff8cda4e37adb0 R13: ffff8cd9e0450000 R14: ffffeb784270ed80 R15: ffff8cd9c5bbd488 FS: 0000000000000000(0000) GS:ffff8cda7fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f4d0a5fced0 CR3: 000000001de10006 CR4: 00000000001706f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to invalid, rc = 22 Lustre: DEBUG MARKER: set checksum type to invalid, rc = 22 Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to crc32, rc = 0 Lustre: DEBUG MARKER: set checksum type to crc32, rc = 0 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 0 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to adler, rc = 0 Lustre: DEBUG MARKER: set checksum type to adler, rc = 0 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Autotest: Test running for 120 minutes (lustre-reviews_review-ldiskfs-dne-arm_108974.57) Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 17 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 17 sec Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to crc32c, rc = 0 Lustre: DEBUG MARKER: set checksum type to crc32c, rc = 0 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to t10ip512, rc = 0 Lustre: DEBUG MARKER: set checksum type to t10ip512, rc = 0 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to t10ip4K, rc = 0 Lustre: DEBUG MARKER: set checksum type to t10ip4K, rc = 0 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec | Link to test |
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul i2c_piix4 ghash_clmulni_intel virtio_balloon joydev pcspkr sunrpc dm_mod ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net virtio_blk net_failover failover CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Not tainted 4.18.0-513.5.1.el8_9.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 80 44 40 00 9d 30 c0 e9 78 44 40 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 61 44 40 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffff9ba940753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000025948867 RBX: fffff0c280965200 RCX: 0000000000000200 RDX: 7fffffffda6b7798 RSI: ffff88be65948000 RDI: ffff88beff1fb000 RBP: 000055dbf39fb000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffec R12: ffff88be45c1bfd8 R13: ffff88be5c47d000 R14: fffff0c282fc7ec0 R15: ffff88be44ee1e80 FS: 0000000000000000(0000) GS:ffff88beffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f45e8010de8 CR3: 0000000019e10005 CR4: 00000000001706f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G L --------- - - 4.18.0-513.5.1.el8_9.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Link to test | ||
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_mod sd_mod t10_pi sg iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sunrpc pcspkr joydev virtio_balloon i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel libata virtio_net net_failover virtio_blk failover serio_raw CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Not tainted 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa64700753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000013cb7867 RBX: ffffedbb004f2dc0 RCX: 0000000000000200 RDX: 7fffffffec348798 RSI: ffff97a513cb7000 RDI: ffff97a5424c4000 RBP: 00005645238c4000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff7 R12: ffff97a5061dd620 R13: ffff97a556260000 R14: ffffedbb01093100 R15: ffff97a5037d4000 FS: 0000000000000000(0000) GS:ffff97a5bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055ec00718a0c CR3: 0000000054810004 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G L --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Link to test | ||
sanity test 39k: write, utime, close, stat | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr i2c_piix4 intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon joydev pcspkr ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_blk virtio_net net_failover failover CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb74b80753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000005b31a867 RBX: ffffe906016cc680 RCX: 0000000000000200 RDX: 7fffffffa4ce5798 RSI: ffff8af05b31a000 RDI: ffff8af04f37e000 RBP: 00005573f077e000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffec R12: ffff8af022c33bf0 R13: ffff8af029052800 R14: ffffe906013cdf80 R15: ffff8af02da88570 FS: 0000000000000000(0000) GS:ffff8af0bfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005573f1ce6038 CR3: 0000000026a10003 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 130 minutes (lustre-reviews_review-dne-zfs-part-4_108666.33) Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_39k: @@@@@@ FAIL: mtime is lost on close: 1730302440, should be 1698679567 Lustre: DEBUG MARKER: sanity test_39k: @@@@@@ FAIL: mtime is lost on close: 1730302440, should be 1698679567 Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2024-10-30/lustre-reviews_review-dne-zfs-part-4_108666_33_4a560301-b4bd-4ca3-9f01-ca4d8691e89b//sanity.test_39k.debug_log.$(hostname -s).1730302589.log; Autotest: Test running for 135 minutes (lustre-reviews_review-dne-zfs-part-4_108666.33) | Link to test |
sanity-compr test iozone: iozone | watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34] Modules linked in: mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) psample mlxfw(OE) intel_rapl_msr intel_rapl_common tls crct10dif_pclmul pci_hyperv_intf crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon pcspkr joydev sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_blk virtio_net net_failover failover CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 a0 6e 41 00 9d 30 c0 e9 98 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 81 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffbe2d80753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000002aebd865 RBX: ffffe34fc0abaf40 RCX: 0000000000000200 RDX: 7fffffffd514279a RSI: ffff99bd6aebd000 RDI: ffff99bd6cee4000 RBP: 0000562030ae4000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff0 R12: ffff99bd6291e720 R13: ffff99bdffc52800 R14: ffffe34fc0b3b900 R15: ffff99bd44475e80 FS: 0000000000000000(0000) GS:ffff99bdffd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055555711c928 CR3: 0000000028210005 CR4: 00000000001706e0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 7189292kB available, using 5444512kB file size Lustre: DEBUG MARKER: min OST has 7189292kB available, using 5444512kB file size Autotest: Test running for 445 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 450 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 455 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 460 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 465 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 470 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 475 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 480 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 485 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 490 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 495 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 500 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 505 minutes (lustre-b_es6_0_full-part-exa6_712.169) Autotest: Test running for 510 minutes (lustre-b_es6_0_full-part-exa6_712.169) | Link to test |
sanity-pcc test 1f: Test auto RW-PCC cache with non-root user | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:35] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon pcspkr joydev ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw failover virtio_blk CPU: 1 PID: 35 Comm: khugepaged Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffb96040753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000001cf4c867 RBX: ffffdeef8073d300 RCX: 0000000000000200 RDX: 7fffffffe30b3798 RSI: ffff8c1a9cf4c000 RDI: ffff8c1acdd65000 RBP: 00007f9527d65000 R08: ffffdeef80c6a008 R09: ffffdeef805531c8 R10: 0000000000000000 R11: 0000000000000009 R12: ffff8c1ac83e3b28 R13: ffff8c1ab6670000 R14: ffffdeef81375940 R15: ffff8c1ad17f9570 FS: 0000000000000000(0000) GS:ffff8c1b3fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f952c6d4000 CR3: 0000000021810001 CR4: 00000000001706e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 35 Comm: khugepaged Kdump: loaded Tainted: P OEL -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: zpool get all Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1 Lustre: lustre-MDT0000: Not available for connect from 10.240.24.50@tcp (stopping) LustreError: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107 LustreError: Skipped 1 previous similar message Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 2 previous similar messages Lustre: server umount lustre-MDT0000 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && LustreError: 60757:0:(ldlm_lockd.c:2575:ldlm_cancel_handler()) ldlm_cancel from 10.240.28.49@tcp arrived at 1727086208 with bad export cookie 4081145193744231493 LustreError: 60757:0:(ldlm_lockd.c:2575:ldlm_cancel_handler()) Skipped 4 previous similar messages Lustre: 12409:0:(client.c:2363:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1727086195/real 1727086195] req@ffff8c1ad38ce3c0 x1810978681027840/t0(0) o400->MGC10.240.28.48@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1727086211 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 LustreError: MGC10.240.28.48@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail LustreError: lustre-MDT0000: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 113 previous similar messages Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3 Lustre: server umount lustre-MDT0002 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && | Link to test |
sanityn test 16k: Parallel FSX and drop caches should not panic | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 joydev pcspkr virtio_balloon ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel libata serio_raw virtio_net net_failover failover virtio_blk CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffff9e7540753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000004cff7867 RBX: ffffee29c133fdc0 RCX: 0000000000000200 RDX: 7fffffffb3008798 RSI: ffff88f04cff7000 RDI: ffff88f059a33000 RBP: 0000558c59633000 R08: ffff88f0bfd00000 R09: ffffffff8c3c6880 R10: 0000000000000007 R11: 00000000fffffff2 R12: ffff88f047e1b198 R13: ffff88f02484a800 R14: ffffee29c1668cc0 R15: ffff88f005c28000 FS: 0000000000000000(0000) GS:ffff88f0bfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055d17cb05fb0 CR3: 0000000022e10002 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 230 minutes (lustre-reviews_review-dne-zfs-part-5_107746.41) Autotest: Test running for 235 minutes (lustre-reviews_review-dne-zfs-part-5_107746.41) Autotest: Test running for 240 minutes (lustre-reviews_review-dne-zfs-part-5_107746.41) Autotest: Test running for 245 minutes (lustre-reviews_review-dne-zfs-part-5_107746.41) | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#1 stuck for 24s! [khugepaged:34] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss intel_rapl_msr nfsv4 intel_rapl_common crct10dif_pclmul crc32_pclmul dns_resolver nfs lockd grace ghash_clmulni_intel fscache joydev pcspkr virtio_balloon i2c_piix4 sunrpc ata_generic ext4 mbcache jbd2 ata_piix libata virtio_net crc32c_intel net_failover serio_raw failover virtio_blk CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa43780753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000139303867 RBX: ffffd6a944e4c0c0 RCX: 0000000000000200 RDX: 7ffffffec6cfc798 RSI: ffff8bb0b9303000 RDI: ffff8bb03fcf9000 RBP: 00007f5da06f9000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff4 R12: ffff8bb0861c97c8 R13: ffff8bb0bbd88000 R14: ffffd6a942ff3e40 R15: ffff8bb083842740 FS: 0000000000000000(0000) GS:ffff8bb0bbd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055e6f592f000 CR3: 0000000026010001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? syscall_return_via_sysret+0x6e/0x94 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_107315.29) Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_107315.29) Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_107315.29) Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_107315.29) Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_107315.29) Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_107315.29) Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_107315.29) Autotest: Test running for 55 minutes (lustre-reviews_review-dne-part-9_107315.29) | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon joydev pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net net_failover crc32c_intel failover serio_raw virtio_blk CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.8.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb25640753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000014194867 RBX: ffffeec340506500 RCX: 0000000000000200 RDX: 7fffffffebe6b798 RSI: ffff99ad14194000 RDI: ffff99ad4986f000 RBP: 0000561bd1c6f000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffe8 R12: ffff99ad03ead378 R13: ffff99ad5464d000 R14: ffffeec341261bc0 R15: ffff99ad05ea63a0 FS: 0000000000000000(0000) GS:ffff99adbfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005563520c03dc CR3: 0000000052c10003 CR4: 00000000000606f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.8.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 15 minutes (lustre-reviews_review-dne-part-9_106849.29) Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_106849.29) Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_106849.29) Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_106849.29) | Link to test |
sanity-sec test 27a: test fileset in various nodemaps | watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [khugepaged:36] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 joydev virtio_balloon pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel virtio_blk serio_raw net_failover failover [last unloaded: dm_flakey] CPU: 0 PID: 36 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.8.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffff9c2c8075bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000030fc2867 RBX: ffffec7140c3f080 RCX: 0000000000000200 RDX: 7fffffffcf03d798 RSI: ffff8c6530fc2000 RDI: ffff8c65617d0000 RBP: 00007f99e25d0000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffffe R12: ffff8c653fc63e80 R13: ffff8c6536678000 R14: ffffec714185f400 R15: ffff8c653f8c2e80 FS: 0000000000000000(0000) GS:ffff8c65bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000557e0c6ff730 CR3: 0000000024810002 CR4: 00000000001706f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 36 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.8.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.active Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.fileset Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.fileset Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.active Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.*.identity_upcall=NONE Lustre: 337649:0:(mdt_lproc.c:315:identity_upcall_store()) lustre-MDT0001: disable "identity_upcall" with ACL enabled maybe cause unexpected "EACCESS" Lustre: 337649:0:(mdt_lproc.c:315:identity_upcall_store()) Skipped 5 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.active Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.admin_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.fileset Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ | Link to test |
sanity test 209: read-only open/close requests should be freed promptly | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:33] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 210: lfs getstripe does not break leases ============================================== 01:13:03 \(1723165983\) Modules linked in: mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw virtio_blk net_failover failover CPU: 0 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.19.2.el8_7.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 d0 6e 22 00 9d 30 c0 e9 c8 6e 22 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 b1 6e 22 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb5290074bd10 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000003ad41867 RBX: ffffdfab40eb5040 RCX: 0000000000000200 RDX: 0000000000000000 RSI: ffff97433ad41000 RDI: ffff97432f5ef000 RBP: 00007f34935ef000 R08: ffffdfab41364888 R09: ffff9743bffcf000 R10: 0000000000000002 R11: 0000000000000009 R12: ffff974304e6cf78 R13: ffff974380e50000 R14: ffffdfab40bd7bc0 R15: ffff9743482e2740 FS: 0000000000000000(0000) GS:ffff9743bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055ab7c5add1c CR3: 000000007f410006 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x8e4/0x1010 khugepaged+0xed0/0x11e0 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x10b/0x130 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-425.19.2.el8_7.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac Lustre: 47087:0:(client.c:2325:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1723166009/real 1723166009] req@00000000dded327b x1806857619793280/t0(0) o400->MGC10.240.24.153@tcp@10.240.24.153@tcp:26/25 lens 224/224 e 0 to 1 dl 1723166018 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker.0' Lustre: 47087:0:(client.c:2325:ptlrpc_expire_one_request()) Skipped 2 previous similar messages LustreError: 166-1: MGC10.240.24.153@tcp: Connection to MGS (at 10.240.24.153@tcp) was lost; in progress operations using this service will fail ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: 47086:0:(client.c:2325:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1723165587/real 1723165587] req@0000000071f8c5a8 x1806857619776256/t0(0) o400->lustre-MDT0000-mdc-ffff9743050f2000@10.240.24.153@tcp:12/10 lens 224/224 e 0 to 1 dl 1723165630 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker.0' Autotest: Test running for 205 minutes (lustre-b_es6_0_rolling-upgrade-client2_689.266) | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache virtio_balloon intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover virtio_blk serio_raw failover CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.8.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb312c0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000001c697867 RBX: ffffedbdc071a5c0 RCX: 0000000000000200 RDX: 7fffffffe3968798 RSI: ffff9dcd5c697000 RDI: ffff9dcd7470a000 RBP: 0000563fd630a000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffffc R12: ffff9dcd45102850 R13: ffff9dcdd8448000 R14: ffffedbdc0d1c280 R15: ffff9dcd45c3fcb0 FS: 0000000000000000(0000) GS:ffff9dcdffd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555556d50928 CR3: 0000000096a10003 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.8.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_106337.29) Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_106337.29) Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_106337.29) Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_106337.29) Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_106337.29) Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_106337.29) Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_106337.29) Autotest: Test running for 55 minutes (lustre-reviews_review-dne-part-9_106337.29) | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 ata_generic mbcache jbd2 ata_piix libata virtio_net crc32c_intel serio_raw net_failover virtio_blk failover CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.8.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa174c0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000002b087867 RBX: ffffce26c0ac21c0 RCX: 0000000000000200 RDX: 7fffffffd4f78798 RSI: ffff94fd2b087000 RDI: ffff94fd2708d000 RBP: 00005568e088d000 R08: ffff94fdbfd00000 R09: ffffffff8a5c5860 R10: 0000000000000007 R11: 00000000fffffffc R12: ffff94fd2236d468 R13: ffff94fd5464d000 R14: ffffce26c09c2340 R15: ffff94fd0605d2b8 FS: 0000000000000000(0000) GS:ffff94fdbfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9cb629a000 CR3: 0000000052c10004 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.8.1.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_106152.29) Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_106152.29) Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_106152.29) Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_106152.29) Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_106152.29) Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_106152.29) Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_106152.29) Autotest: Test running for 55 minutes (lustre-reviews_review-dne-part-9_106152.29) Lustre: 8603:0:(client.c:2362:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1721091491/real 0] req@ffff94fd0fafb0c0 x1804691444261888/t0(0) o400->lustre-MDT0001-mdc-ffff94fd03f5d800@10.240.23.50@tcp:12/10 lens 224/224 e 0 to 1 dl 1721091507 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: lustre-MDT0001-mdc-ffff94fd03f5d800: Connection to lustre-MDT0001 (at 10.240.23.50@tcp) was lost; in progress operations using this service will wait for recovery to complete | Link to test |
replay-single test 135: Server failure in lock replay phase | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover virtio_blk failover CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffa94480753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000002b0b7867 RBX: ffffe103c0ac2dc0 RCX: 0000000000000200 RDX: 7fffffffd4f48798 RSI: ffff94cbeb0b7000 RDI: ffff94cbd3b04000 RBP: 00007fea35b04000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffe9 R12: ffff94cc1fcde820 R13: ffff94cc28660000 R14: ffffe103c04ec100 R15: ffff94cc1c551570 FS: 0000000000000000(0000) GS:ffff94cc7fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fea341b2024 CR3: 0000000066c10003 CR4: 00000000000606e0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: sync; sync; sync Autotest: Test running for 205 minutes (lustre-master_full-dne-zfs-part-1_4552.10) Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-OST0000 notransno Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-OST0000 readonly LustreError: 342105:0:(osd_handler.c:698:osd_ro()) lustre-OST0000: *** setting device osd-zfs read-only *** Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000 | Link to test |
sanity-pcc test 20: Auto attach works after the inode was once evicted from cache | watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic crc32c_intel virtio_net ata_piix libata net_failover failover serio_raw virtio_blk CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffbae280753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000053d2d867 RBX: ffffe266c14f4b40 RCX: 0000000000000200 RDX: 7fffffffac2d2798 RSI: ffff9b0c93d2d000 RDI: ffff9b0c9c7f3000 RBP: 0000561ccdff3000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff4 R12: ffff9b0c79e7ef98 R13: ffff9b0c5a65d000 R14: ffffe266c171fcc0 R15: ffff9b0c6f52fae0 FS: 0000000000000000(0000) GS:ffff9b0cffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffda90d7224 CR3: 0000000018c10003 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl set_param debug=-1 debug_mb=150 Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-pcc test_20: @@@@@@ FAIL: \/mnt\/lustre\/f20.sanity-pcc expected pcc state: readwrite, but got: none Lustre: DEBUG MARKER: sanity-pcc test_20: @@@@@@ FAIL: /mnt/lustre/f20.sanity-pcc expected pcc state: readwrite, but got: none Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2024-07-07/lustre-b_es-reviews_review-dne-part-7_19423_16_0fb3f3a1-5141-4af6-8a5e-eeb2d13c9cc1//sanity-pcc.test_20.debug_log.$(hostname -s).1720388886.log; Autotest: Test running for 195 minutes (lustre-b_es-reviews_review-dne-part-7_19423.16) Lustre: lustre-MDT0001: Client c497ef3c-da29-4a94-81f5-e8dd962f064b (at 10.240.39.85@tcp) reconnecting Lustre: lustre-MDT0003: Received new MDS connection from 10.240.39.89@tcp, keep former export from same NID Lustre: HSM agent c497ef3c-da29-4a94-81f5-e8dd962f064b already registered Lustre: 386229:0:(service.c:2157:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 8s req@00000000b8b4df30 x1803946525602432/t0(0) o400->lustre-MDT0002-mdtlov_UUID@10.240.39.89@tcp:0/0 lens 224/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 job:'kworker.0' | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34] Modules linked in: mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr virtio_balloon joydev i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw virtio_blk failover CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa88680753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000022788867 RBX: ffffde970089e200 RCX: 0000000000000200 RDX: 7fffffffdd877798 RSI: ffff9af0a2788000 RDI: ffff9af0a2f1e000 RBP: 000055cd3111e000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffe8 R12: ffff9af0842468f0 R13: ffff9af13fd48000 R14: ffffde97008bc780 R15: ffff9af09b1dd3a0 FS: 0000000000000000(0000) GS:ffff9af13fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555d9021ad2c CR3: 0000000009810002 CR4: 00000000000606e0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.el8_10.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 15 minutes (lustre-reviews_review-dne-part-9_105840.29) Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_105840.29) Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_105840.29) Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_105840.29) Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_105840.29) Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_105840.29) Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_105840.29) | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:31] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw virtio_blk net_failover failover CPU: 0 PID: 31 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffa9b600733d48 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000077703867 RBX: ffffdbae41b43640 RCX: 0000000000000200 RDX: 7fffffff888fc798 RSI: ffff9596f7703000 RDI: ffff9596ed0d9000 RBP: 0000555eab8d9000 R08: ffffdbae42480518 R09: ffff95973ffce000 R10: 00000000000305c0 R11: ffffffffffffffe8 R12: ffffdbae41ddc0c0 R13: ffff959727d536c8 R14: ffff9596b64adf00 R15: ffff95971357f2b8 FS: 0000000000000000(0000) GS:ffff95973fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000313000 CR3: 000000003880a004 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x6b6/0xf10 khugepaged+0xb5b/0x1150 ? finish_wait+0x80/0x80 ? collapse_huge_page+0xf10/0xf10 kthread+0x112/0x130 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 31 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x5c/0x80 panic+0xe7/0x2a9 ? __switch_to_asm+0x51/0x70 watchdog_timer_fn.cold.8+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x100/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 115 minutes (lustre-b2_15_full-part-1_94.100) Autotest: Test running for 120 minutes (lustre-b2_15_full-part-1_94.100) Autotest: Test running for 125 minutes (lustre-b2_15_full-part-1_94.100) | Link to test |
sanity-quota test 7a: Quota reintegration (global index) | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4 virtio_balloon ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw net_failover virtio_blk failover CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffff98cb00753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000000fa06867 RBX: ffffeb28403e8180 RCX: 0000000000000200 RDX: 7ffffffff05f9798 RSI: ffff8d310fa06000 RDI: ffff8d3144da3000 RBP: 000055e6577a3000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff9 R12: ffff8d3142587d18 R13: ffff8d312d44d000 R14: ffffeb28411368c0 R15: ffff8d313b550e80 FS: 0000000000000000(0000) GS:ffff8d31bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055ba98fa01ec CR3: 000000002ba10003 CR4: 00000000001706f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8f2/0x1020 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL -------- - - 4.18.0-553.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0 Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1 Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.enabled Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.enabled Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.enabled Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.enabled Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.enabled Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1 Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d /mnt/lustre-ost1 Lustre: Failing over lustre-OST0000 Lustre: server umount lustre-OST0000 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: ! zpool list -H lustre-ost1 >/dev/null 2>&1 || LustreError: lustre-OST0000: not available for connect from 10.240.26.176@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: lustre-OST0000: not available for connect from 10.240.26.186@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: lustre-OST0000: not available for connect from 10.240.26.176@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: lustre-OST0000: not available for connect from 10.240.26.186@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.enabled LustreError: 57797:0:(qsd_reint.c:633:qqi_reint_delayed()) lustre-OST0001: Delaying reintegration for qtype:0 until pending updates are flushed. LustreError: 57797:0:(qsd_reint.c:633:qqi_reint_delayed()) Skipped 5 previous similar messages LustreError: lustre-OST0000: not available for connect from 10.240.26.186@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.enabled Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1 Lustre: DEBUG MARKER: lsmod | grep zfs >&/dev/null || modprobe zfs; LustreError: lustre-OST0000: not available for connect from 10.240.26.176@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 2 previous similar messages Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-ost1/ost1 Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1; mount -t lustre -o localrecov lustre-ost1/ost1 /mnt/lustre-ost1 LustreError: lustre-OST0000: not available for connect from 10.240.26.186@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 6 previous similar messages Lustre: lustre-OST0000: Imperative Recovery enabled, recovery window shrunk from 60-180 down to 60-180 Lustre: lustre-OST0000: in recovery but waiting for the first client to connect Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-ost1/ost1 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl set_param seq.cli-lustre-OST0000-super.width=65536 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect Lustre: lustre-OST0000: Recovery over after 0:09, of 2 clients 2 recovered and 0 were evicted. Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all LustreError: 36287:0:(qsd_reint.c:633:qqi_reint_delayed()) lustre-OST0000: Delaying reintegration for qtype:0 until pending updates are flushed. LustreError: 36287:0:(qsd_reint.c:633:qqi_reint_delayed()) Skipped 2 previous similar messages Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-ost1/ost1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-ost1/ost1 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=+quota+trace Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.info | Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.info | Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475 Autotest: Test running for 40 minutes (lustre-reviews_review-zfs_105753.4) Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.info | Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.info | Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.info | Lustre: DEBUG MARKER: /usr/sbin/lctl dl Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.info | | Link to test |
parallel-scale test write_append_truncate: write_append_truncate | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs intel_rapl_msr lockd intel_rapl_common grace fscache crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net virtio_blk net_failover failover [last unloaded: dm_flakey] CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb31c80753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000036fb9867 RBX: ffffede680dbee40 RCX: 0000000000000200 RDX: 7fffffffc9046798 RSI: ffff983af6fb9000 RDI: ffff983ac087d000 RBP: 0000562be067d000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff3 R12: ffff983b184e53e8 R13: ffff983b6946d000 R14: ffffede680021f40 R15: ffff983ac3c29740 FS: 0000000000000000(0000) GS:ffff983b7fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b47cd31da0 CR3: 00000000a7a10002 CR4: 00000000001706f0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 455 minutes (lustre-master_full-dne-part-1_4543.7) Autotest: Test running for 460 minutes (lustre-master_full-dne-part-1_4543.7) Autotest: Test running for 465 minutes (lustre-master_full-dne-part-1_4543.7) | Link to test |
sanity-quota test 18: MDS failover while writing, no watchdog triggered (b14840) | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: ib_core nfsv3 nfsd nfs_acl mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover failover serio_raw virtio_blk CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.10.1.el8_8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 a0 7e 41 00 9d 30 c0 e9 98 7e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 81 7e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffac6a00753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000001359e845 RBX: fffffa2c404d6780 RCX: 0000000000000200 RDX: 7fffffffeca617ba RSI: ffff9cc71359e000 RDI: ffff9cc73405a000 RBP: 0000561167a5a000 R08: fffffa2c40b55d88 R09: 0000000000000000 R10: 0000000000000007 R11: 00000000fffffffd R12: ffff9cc7295312d0 R13: ffff9cc766865000 R14: fffffa2c40d01680 R15: ffff9cc7357bebc8 FS: 0000000000000000(0000) GS:ffff9cc7bfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056548c7267d0 CR3: 0000000064e10002 CR4: 00000000000606e0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.10.1.el8_8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark User quota \(limit: 200\) Lustre: DEBUG MARKER: User quota (limit: 200) Lustre: DEBUG MARKER: /usr/sbin/lctl mark Write 100M \(buffered\) ... Lustre: DEBUG MARKER: Write 100M (buffered) ... Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname) Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi Lustre: DEBUG MARKER: /usr/sbin/lctl mark Fail mds for 0 seconds Lustre: DEBUG MARKER: Fail mds for 0 seconds Lustre: 3716:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1717593672/real 1717593672] req@00000000c35ba9d6 x1801007592684736/t0(0) o400->lustre-MDT0000-mdc-ffff9cc7296c7800@10.240.30.158@tcp:12/10 lens 224/224 e 0 to 1 dl 1717593681 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 3716:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection to lustre-MDT0000 (at 10.240.30.158@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 166-1: MGC10.240.30.158@tcp: Connection to MGS (at 10.240.30.158@tcp) was lost; in progress operations using this service will fail Lustre: Evicted from MGS (at 10.240.30.158@tcp) after server handle changed from 0x357653082f39c9cc to 0x357653082f494d50 Lustre: MGC10.240.30.158@tcp: Connection restored to 10.240.30.158@tcp (at 10.240.30.158@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-130vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 LustreError: 167-0: lustre-MDT0000-mdc-ffff9cc7296c7800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. LustreError: Skipped 1 previous similar message Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection restored to 10.240.30.158@tcp (at 10.240.30.158@tcp) Lustre: DEBUG MARKER: onyx-130vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 3716:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1717593784/real 1717593786] req@000000002e85f9e8 x1801007592687424/t0(0) o400->lustre-MDT0000-mdc-ffff9cc7296c7800@10.240.30.158@tcp:12/10 lens 224/224 e 0 to 1 dl 1717593793 ref 1 fl Rpc:RXNQ/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 3716:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection to lustre-MDT0000 (at 10.240.30.158@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection restored to 10.240.30.158@tcp (at 10.240.30.158@tcp) Autotest: Test running for 320 minutes (lustre-b_es6_0_full-part-2_650.120) Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-60vm4.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection to lustre-MDT0000 (at 10.240.30.158@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection restored to 10.240.30.158@tcp (at 10.240.30.158@tcp) Lustre: DEBUG MARKER: onyx-60vm4.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec | Link to test |
sanity test 133f: Check reads/writes of client lustre proc files with bad area io | watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34] Modules linked in: dm_flakey obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr joydev i2c_piix4 virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk [last unloaded: dm_flakey] CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa43100753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000076051867 RBX: ffffca4f41d81440 RCX: 0000000000000200 RDX: 7fffffff89fae798 RSI: ffff94f2b6051000 RDI: ffff94f27e2f5000 RBP: 000055796c4f5000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff8 R12: ffff94f26bc477a8 R13: ffff94f2b5252800 R14: ffffca4f40f8bd40 R15: ffff94f245728d98 FS: 0000000000000000(0000) GS:ffff94f2ffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055d4db1322b8 CR3: 0000000072c10002 CR4: 00000000000606f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 Lustre: Failing over lustre-MDT0000 Lustre: lustre-MDT0000: Not available for connect from 10.240.23.65@tcp (stopping) Lustre: lustre-MDT0000: Not available for connect from 10.240.23.64@tcp (stopping) Lustre: lustre-MDT0000: Not available for connect from 10.240.23.66@tcp (stopping) Lustre: lustre-MDT0000: Not available for connect from 10.240.23.65@tcp (stopping) Lustre: Skipped 6 previous similar messages Lustre: lustre-MDT0000: Not available for connect from 10.240.23.65@tcp (stopping) Lustre: Skipped 9 previous similar messages LustreError: 137-5: lustre-MDT0000: not available for connect from 10.240.23.66@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Autotest: Test running for 165 minutes (lustre-reviews_review-ldiskfs_103807.37) Lustre: server umount lustre-MDT0000 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: dmsetup table /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: dmsetup remove /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1 Lustre: DEBUG MARKER: modprobe -r dm-flakey Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: test -b /dev/vg_Role_MDS/mdt1 Lustre: DEBUG MARKER: blockdev --getsz /dev/vg_Role_MDS/mdt1 2>/dev/null Lustre: DEBUG MARKER: dmsetup create mds1_flakey --table "0 4194304 linear /dev/vg_Role_MDS/mdt1 0" Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1 Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1 LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180 Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /sbin/lctl mark onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all Lustre: DEBUG MARKER: /sbin/lctl mark onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all Lustre: DEBUG MARKER: onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all Lustre: DEBUG MARKER: onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. Lustre: lustre-OST0002-osc-MDT0000: update sequence from 0x2c0000bd2 to 0x2c0000bd3 Lustre: lustre-OST0005-osc-MDT0000: update sequence from 0x380000bd2 to 0x380000bd3 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-36vm2: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-36vm1: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: lustre-OST0003-osc-MDT0000: update sequence from 0x300000bd2 to 0x300000bd3 Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x280000bd2 to 0x280000bd3 Lustre: DEBUG MARKER: onyx-36vm2: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-36vm1: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1 LustreError: 682462:0:(osp_precreate.c:670:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -5 LustreError: 682462:0:(osp_precreate.c:1374:osp_precreate_thread()) lustre-OST0000-osc-MDT0000: cannot precreate objects: rc = -5 Lustre: lustre-MDT0000: Not available for connect from 10.240.23.66@tcp (stopping) Lustre: Skipped 2 previous similar messages Lustre: server umount lustre-MDT0000 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: dmsetup table /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: dmsetup remove /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1 Lustre: DEBUG MARKER: modprobe -r dm-flakey Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-mds1' ' /proc/mounts); Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-mds1' ' /proc/mounts); Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1 Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: test -b /dev/vg_Role_MDS/mdt1 Lustre: DEBUG MARKER: blockdev --getsz /dev/vg_Role_MDS/mdt1 2>/dev/null Lustre: DEBUG MARKER: dmsetup create mds1_flakey --table "0 4194304 linear /dev/vg_Role_MDS/mdt1 0" Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1 Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1 LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u Lustre: DEBUG MARKER: /sbin/lctl mark onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all Lustre: DEBUG MARKER: /sbin/lctl mark onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all Lustre: DEBUG MARKER: onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all Lustre: DEBUG MARKER: onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null Autotest: Test running for 170 minutes (lustre-reviews_review-ldiskfs_103807.37) | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon pcspkr i2c_piix4 joydev sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw virtio_blk net_failover failover CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa712c0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000010029867 RBX: ffffe8cd40400a40 RCX: 0000000000000200 RDX: 7fffffffeffd6798 RSI: ffff940950029000 RDI: ffff9409787b4000 RBP: 000056482d5b4000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffe8 R12: ffff940974be9da0 R13: ffff94098ee55000 R14: ffffe8cd40e1ed00 R15: ffff9409451e5828 FS: 0000000000000000(0000) GS:ffff9409ffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9af5b47024 CR3: 000000004c810005 CR4: 00000000000606f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_103593.18) Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_103593.18) Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_103593.18) Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_103593.18) Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_103593.18) Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_103593.18) Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_103593.18) Autotest: Test running for 55 minutes (lustre-reviews_review-dne-part-9_103593.18) | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon joydev pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw virtio_blk failover CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffff9fd100753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000001dde0867 RBX: fffff20ec0777800 RCX: 0000000000000200 RDX: 7fffffffe221f798 RSI: ffff89579dde0000 RDI: ffff895798767000 RBP: 000056070bf67000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffea R12: ffff895783e9db38 R13: ffff895812255000 R14: fffff20ec061d9c0 R15: ffff8957859ed910 FS: 0000000000000000(0000) GS:ffff89583fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056070bf193e8 CR3: 000000008fc10003 CR4: 00000000000606f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Link to test | |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon pcspkr joydev sunrpc ext4 mbcache jbd2 ata_generic ata_piix virtio_net libata crc32c_intel net_failover failover virtio_blk serio_raw CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.18.1.el8_9.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa67080753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000027570867 RBX: ffffebbf009d5c00 RCX: 0000000000000200 RDX: 7fffffffd8a8f798 RSI: ffff9477a7570000 RDI: ffff9477b93a4000 RBP: 000055bb0dba4000 R08: ffffebbf009d5848 R09: ffff94783ffcf000 R10: 0000000000000001 R11: ffffebbf00b50688 R12: ffff947783995d20 R13: ffff94780924d000 R14: ffffebbf00e4e900 R15: ffff9477855b2910 FS: 0000000000000000(0000) GS:ffff94783fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000558bdbccfa8c CR3: 0000000086c10003 CR4: 00000000001706f0 Call Trace: <IRQ> ? watchdog_timer_fn.cold.10+0x46/0x9e ? watchdog+0x30/0x30 ? __hrtimer_run_queues+0x101/0x280 ? hrtimer_interrupt+0x100/0x220 ? smp_apic_timer_interrupt+0x6a/0x130 ? apic_timer_interrupt+0xf/0x20 </IRQ> ? copy_page+0x7/0x10 collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.18.1.el8_9.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x11/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Link to test | |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover failover serio_raw virtio_blk CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 a0 6e 41 00 9d 30 c0 e9 98 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 81 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffabef8074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000014a40867 RBX: ffffe20bc0529000 RCX: 0000000000000200 RDX: 7fffffffeb5bf798 RSI: ffff9dc0d4a40000 RDI: ffff9dc0ec2a5000 RBP: 0000557f40ea5000 R08: ffffe20bc0532e08 R09: ffff9dc17ffcf000 R10: 0000000000000000 R11: ffffe20bc053d188 R12: ffff9dc0e6080528 R13: ffff9dc17fc45000 R14: ffffe20bc0b0a940 R15: ffff9dc0e61fdd98 FS: 0000000000000000(0000) GS:ffff9dc17fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007eff93769100 CR3: 0000000029410005 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Link to test | |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4 virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw net_failover virtio_blk failover [last unloaded: dm_flakey] CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffb05a4074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000043ee4867 RBX: ffffeb0dc10fb900 RCX: 0000000000000200 RDX: 7fffffffbc11b798 RSI: ffff9bba03ee4000 RDI: ffff9bba13fc4000 RBP: 000055fd8a7c4000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff1 R12: ffff9bb9e7e6fe20 R13: ffff9bba40a72800 R14: ffffeb0dc14ff100 R15: ffff9bba13295e80 FS: 0000000000000000(0000) GS:ffff9bba7fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbbaf370d38 CR3: 000000007f010004 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x8d7/0x1000 ? mutex_lock+0xe/0x30 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Link to test | |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa47180753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000028646865 RBX: fffffbcb80a19180 RCX: 0000000000000200 RDX: 7fffffffd79b979a RSI: ffff913528646000 RDI: ffff9135241f2000 RBP: 000055699e3f2000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff5 R12: ffff9135360daf90 R13: ffff91359b87d000 R14: fffffbcb80907c80 R15: ffff913505cd61d0 FS: 0000000000000000(0000) GS:ffff9135bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000564dfb065018 CR3: 0000000099e10004 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Link to test | |
sanity test 17o: stat file with incompat LMA feature | watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc dm_mod ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw virtio_blk net_failover failover [last unloaded: obdecho] CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffff9e35c0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000033766865 RBX: ffffd0dd80cdd980 RCX: 0000000000000200 RDX: 7fffffffcc89979a RSI: ffff91c1f3766000 RDI: ffff91c24fdad000 RBP: 0000561f7c9ad000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffef R12: ffff91c1dfdefd68 R13: ffff91c224052800 R14: ffffd0dd823f6b40 R15: ffff91c1c297bae0 FS: 0000000000000000(0000) GS:ffff91c27fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055ab08a86368 CR3: 0000000062610004 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-95vm12.trevis.whamcloud.com: executing set_default_debug all all 4 Lustre: DEBUG MARKER: trevis-95vm12.trevis.whamcloud.com: executing set_default_debug all all 4 Lustre: lustre-OST0006: deleting orphan objects from 0x3c0000bd0:6 to 0x3c0000bd0:33 Lustre: lustre-OST0000: deleting orphan objects from 0x280000bd0:6 to 0x280000bd0:33 Lustre: lustre-OST0001: deleting orphan objects from 0x240000400:87846 to 0x240000400:87873 Lustre: lustre-OST0002: deleting orphan objects from 0x2c0000bd0:7 to 0x2c0000bd0:33 Lustre: lustre-OST0005: deleting orphan objects from 0x380000bd0:7 to 0x380000bd0:33 Lustre: lustre-OST0004: deleting orphan objects from 0x340000400:87943 to 0x340000400:87969 Lustre: 66605:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708071801/real 1708071801] req@ffff91c22bca96c0 x1791006021629312/t0(0) o400->MGC10.240.43.223@tcp@10.240.43.223@tcp:26/25 lens 224/224 e 0 to 1 dl 1708071817 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: 66605:0:(client.c:2337:ptlrpc_expire_one_request()) Skipped 22 previous similar messages LustreError: 166-1: MGC10.240.43.223@tcp: Connection to MGS (at 10.240.43.223@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0003: deleting orphan objects from 0x300000bd0:7 to 0x300000bd0:33 Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.43.223@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 2 previous similar messages Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to (at 10.240.43.223@tcp) Lustre: Skipped 6 previous similar messages Lustre: lustre-MDT0000-lwp-OST0002: Connection restored to (at 10.240.43.223@tcp) Lustre: Skipped 1 previous similar message Lustre: Evicted from MGS (at 10.240.43.223@tcp) after server handle changed from 0xf9122bd54ccdc1e9 to 0xf9122bd54ccde7eb Lustre: 66604:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708071810/real 1708071810] req@ffff91c20ba15a00 x1791006021629888/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.43.223@tcp:12/10 lens 224/224 e 0 to 1 dl 1708071826 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: 66604:0:(client.c:2337:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: 'ost_seq' is processing requests too slowly, client may timeout. Late by 6s, missed 1 early replies (reqs waiting=0 active=1, at_estimate=5, delay=11224ms) Lustre: 1280748:0:(service.c:1397:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff91c1fd68b740 x1791024350124544/t0(0) o700->lustre-MDT0000-mdtlov_UUID@10.240.43.223@tcp:601/0 lens 264/248 e 0 to 0 dl 1708071831 ref 2 fl Interpret:/200/0 rc 0/0 job:'osp-pre-4-0.0' uid:0 gid:0 Lustre: lustre-OST0004: Client lustre-MDT0000-mdtlov_UUID (at 10.240.43.223@tcp) reconnecting Lustre: Skipped 2 previous similar messages Lustre: 66605:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708071843/real 1708071843] req@ffff91c1d20623c0 x1791006021633088/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.43.223@tcp:12/10 lens 224/224 e 0 to 1 dl 1708071859 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 Lustre: 66605:0:(client.c:2337:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.43.223@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 6 previous similar messages LustreError: 166-1: MGC10.240.43.223@tcp: Connection to MGS (at 10.240.43.223@tcp) was lost; in progress operations using this service will fail LustreError: 1280739:0:(mgc_request.c:619:do_requeue()) failed processing log: -5 Lustre: ll_ost_seq00_00: service thread pid 1280749 was inactive for 45.610 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 1280749, comm: ll_ost_seq00_00 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Wed Jan 3 18:54:51 UTC 2024 Call Trace TBD: [<0>] cv_wait_common+0xaf/0x130 [spl] [<0>] txg_wait_synced_impl+0xc6/0x110 [zfs] [<0>] txg_wait_synced+0xc/0x40 [zfs] [<0>] osd_trans_stop+0x510/0x550 [osd_zfs] [<0>] seq_store_update+0x2ff/0x9c0 [fid] [<0>] seq_server_check_and_alloc_super+0x96/0x2f0 [fid] [<0>] seq_server_alloc_meta+0x66/0x650 [fid] [<0>] seq_handler+0x590/0x5a0 [fid] [<0>] tgt_request_handle+0x3f4/0x19a0 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x3ca/0xbd0 [ptlrpc] [<0>] ptlrpc_main+0xc90/0x15b0 [ptlrpc] [<0>] kthread+0x134/0x150 [<0>] ret_from_fork+0x35/0x40 Lustre: lustre-OST0006: haven't heard from client 2f61269b-4f35-4d53-b08c-097cfab6ac41 (at 10.240.43.204@tcp) in 32 seconds. I think it's dead, and I am evicting it. exp ffff91c217125000, cur 1708071882 expire 1708071852 last 1708071850 Lustre: lustre-OST0005: Client lustre-MDT0000-mdtlov_UUID (at 10.240.43.223@tcp) reconnecting LustreError: 1280739:0:(mgc_request.c:619:do_requeue()) failed processing log: -5 Lustre: lustre-OST0005: Client lustre-MDT0000-mdtlov_UUID (at 10.240.43.223@tcp) reconnecting Lustre: Skipped 2 previous similar messages LNet: There was an unexpected network error while writing to 10.240.43.204: rc = -32 Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to (at 10.240.43.223@tcp) Lustre: Skipped 5 previous similar messages Lustre: lustre-OST0000: Client 2f61269b-4f35-4d53-b08c-097cfab6ac41 (at 10.240.43.204@tcp) reconnecting Lustre: Skipped 2 previous similar messages Lustre: ll_ost00_001: service thread pid 1280741 was inactive for 79.388 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 1280741, comm: ll_ost00_001 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Wed Jan 3 18:54:51 UTC 2024 Call Trace TBD: [<0>] dbuf_read_done+0x19/0x100 [zfs] [<0>] arc_read+0xb1e/0x1500 [zfs] [<0>] dbuf_read_impl.constprop.32+0x26c/0x6b0 [zfs] [<0>] dbuf_read+0x1b5/0x580 [zfs] [<0>] dbuf_hold_impl+0x484/0x630 [zfs] [<0>] dbuf_hold_level+0x2b/0x60 [zfs] [<0>] dmu_tx_check_ioerr+0x32/0xd0 [zfs] [<0>] dmu_tx_hold_zap_impl+0x70/0x80 [zfs] [<0>] osd_declare_destroy+0x232/0x480 [osd_zfs] [<0>] ofd_destroy+0x1ae/0xb50 [ofd] [<0>] ofd_destroy_by_fid+0x25e/0x4a0 [ofd] [<0>] ofd_orphans_destroy+0x248/0x910 [ofd] [<0>] ofd_create_hdl+0x189a/0x19b0 [ofd] [<0>] tgt_request_handle+0x3f4/0x19a0 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x3ca/0xbd0 [ptlrpc] [<0>] ptlrpc_main+0xc90/0x15b0 [ptlrpc] [<0>] kthread+0x134/0x150 [<0>] ret_from_fork+0x35/0x40 Lustre: ll_ost00_003: service thread pid 1281652 was inactive for 82.326 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 1281652, comm: ll_ost00_003 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Wed Jan 3 18:54:51 UTC 2024 Call Trace TBD: [<0>] dbuf_read_done+0x19/0x100 [zfs] [<0>] arc_read+0xb1e/0x1500 [zfs] [<0>] dbuf_read_impl.constprop.32+0x26c/0x6b0 [zfs] [<0>] dbuf_read+0x1b5/0x580 [zfs] [<0>] dbuf_hold_impl+0x484/0x630 [zfs] [<0>] dbuf_hold_level+0x2b/0x60 [zfs] [<0>] dmu_tx_check_ioerr+0x32/0xd0 [zfs] [<0>] dmu_tx_hold_zap_impl+0x70/0x80 [zfs] [<0>] osd_declare_destroy+0x232/0x480 [osd_zfs] [<0>] ofd_destroy+0x1ae/0xb50 [ofd] [<0>] ofd_destroy_by_fid+0x25e/0x4a0 [ofd] [<0>] ofd_orphans_destroy+0x248/0x910 [ofd] [<0>] ofd_create_hdl+0x189a/0x19b0 [ofd] [<0>] tgt_request_handle+0x3f4/0x19a0 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x3ca/0xbd0 [ptlrpc] [<0>] ptlrpc_main+0xc90/0x15b0 [ptlrpc] [<0>] kthread+0x134/0x150 [<0>] ret_from_fork+0x35/0x40 Lustre: ll_ost00_006: service thread pid 1281656 was inactive for 89.030 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Lustre: lustre-OST0004: Client lustre-MDT0000-mdtlov_UUID (at 10.240.43.223@tcp) reconnecting Lustre: Skipped 5 previous similar messages Lustre: 1280739:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708071910/real 1708071910] req@ffff91c24a16c9c0 x1791006021637824/t0(0) o503->MGC10.240.43.223@tcp@10.240.43.223@tcp:26/25 lens 272/8416 e 0 to 1 dl 1708071928 ref 2 fl Rpc:XQr/200/ffffffff rc 0/-1 job:'ll_cfg_requeue.0' uid:0 gid:0 Lustre: 1280739:0:(client.c:2337:ptlrpc_expire_one_request()) Skipped 21 previous similar messages LustreError: 166-1: MGC10.240.43.223@tcp: Connection to MGS (at 10.240.43.223@tcp) was lost; in progress operations using this service will fail LustreError: 1280739:0:(mgc_request.c:619:do_requeue()) failed processing log: -5 Lustre: MGC10.240.43.223@tcp: Connection restored to (at 10.240.43.223@tcp) Lustre: Skipped 7 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-95vm10.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-\*.mds_server_uuid LustreError: 166-1: MGC10.240.43.223@tcp: Connection to MGS (at 10.240.43.223@tcp) was lost; in progress operations using this service will fail LustreError: 1280739:0:(mgc_request.c:619:do_requeue()) failed processing log: -5 Lustre: MGC10.240.43.223@tcp: Connection restored to (at 10.240.43.223@tcp) Lustre: DEBUG MARKER: trevis-95vm10.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid | Link to test |
sanity test 259: crash at delayed truncate | watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34] Modules linked in: dm_flakey osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover failover virtio_blk [last unloaded: obdecho] CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G W OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb67100753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000085b46865 RBX: fffff7cf4216d180 RCX: 0000000000000200 RDX: 7fffffff7a4b979a RSI: ffff898b45b46000 RDI: ffff898adef00000 RBP: 000055a63b500000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffed R12: ffff898afda68800 R13: ffff898b37462800 R14: fffff7cf407bc000 R15: ffff898afd91e3a0 FS: 0000000000000000(0000) GS:ffff898b7fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3cf7d37008 CR3: 0000000075a10002 CR4: 00000000001706e0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G W OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-*.*OST0000.kbytesfree Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-*.*OST0000.kbytesfree Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x2301 Lustre: *** cfs_fail_loc=2301, val=0*** Lustre: Skipped 3 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-*.*OST0000.kbytesfree Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d /mnt/lustre-ost1 Lustre: Failing over lustre-OST0000 LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 2 previous similar messages Lustre: server umount lustre-OST0000 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 4 previous similar messages Lustre: DEBUG MARKER: modprobe dm-flakey; LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 4 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1 Lustre: DEBUG MARKER: modprobe dm-flakey; Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1 Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey 2>&1 Lustre: DEBUG MARKER: test -b /dev/mapper/ost1_flakey Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 9 previous similar messages Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1; mount -t lustre -o localrecov /dev/mapper/ost1_flakey /mnt/lustre-ost1 LDISKFS-fs (dm-9): 1 truncate cleaned up LDISKFS-fs (dm-9): recovery complete LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 19 previous similar messages Lustre: lustre-OST0000: Imperative Recovery not enabled, recovery window 60-180 Lustre: Skipped 1 previous similar message Lustre: lustre-OST0000: in recovery but waiting for the first client to connect Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 5 clients reconnect Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl set_param seq.cli-lustre-OST0000-super.width=16384 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: lustre-OST0000: Recovery over after 0:03, of 5 clients 5 recovered and 0 were evicted. Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u | Link to test |
hot-pools test 8: lamigo: start with debug (-b) command line option | watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34] Modules linked in: ib_core mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sunrpc pcspkr joydev virtio_balloon i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw net_failover failover virtio_blk CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 a0 6e 41 00 9d 30 c0 e9 98 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 81 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffff9df68074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000002b419867 RBX: fffff887c0ad0640 RCX: 0000000000000200 RDX: 7fffffffd4be6798 RSI: ffff89f66b419000 RDI: ffff89f6799c6000 RBP: 000055ada3bc6000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffef R12: ffff89f666218e30 R13: ffff89f6ffc4d000 R14: fffff887c0e67180 R15: ffff89f6660c39f8 FS: 0000000000000000(0000) GS:ffff89f6ffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555a7b084e58 CR3: 0000000050a10001 CR4: 00000000001706f0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-72vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-72vm10.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-72vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-72vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-72vm10.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-72vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 | Link to test |
sanityn test 43k: unlink vs create | watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover failover virtio_blk [last unloaded: dm_flakey] CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa5ef00753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 0000000034b0d867 RBX: ffffce81c0d2c340 RCX: 0000000000000200 RDX: 7fffffffcb4f2798 RSI: ffff912574b0d000 RDI: ffff9125816ce000 RBP: 000055697e4ce000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffe8 R12: ffff91256ca0a670 R13: ffff9125ba24d000 R14: ffffce81c105b380 R15: ffff91254597a2b8 FS: 0000000000000000(0000) GS:ffff9125ffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000558b87975f44 CR3: 0000000078810001 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497 Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true LustreError: 388722:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping LustreError: 385063:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking LustreError: 388722:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497 Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true LustreError: 388722:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping LustreError: 385061:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking LustreError: 388722:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497 Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true LustreError: 385061:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking LustreError: 385061:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4498 Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true LustreError: 385063:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping LustreError: 385061:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking LustreError: 385063:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497 Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) Skipped 2 previous similar messages LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) Skipped 2 previous similar messages LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497 LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) Skipped 2 previous similar messages Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true LustreError: 410181:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping LustreError: 410181:0:(libcfs_fail.h:169:cfs_race()) Skipped 5 previous similar messages LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) Skipped 5 previous similar messages LustreError: 410181:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4498 LustreError: 410181:0:(libcfs_fail.h:178:cfs_race()) Skipped 5 previous similar messages Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true LustreError: 410181:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping LustreError: 410181:0:(libcfs_fail.h:169:cfs_race()) Skipped 9 previous similar messages LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) Skipped 9 previous similar messages LustreError: 410181:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497 LustreError: 410181:0:(libcfs_fail.h:178:cfs_race()) Skipped 9 previous similar messages Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true LustreError: 388722:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping LustreError: 388722:0:(libcfs_fail.h:169:cfs_race()) Skipped 21 previous similar messages LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) Skipped 21 previous similar messages LustreError: 388722:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4498 LustreError: 388722:0:(libcfs_fail.h:178:cfs_race()) Skipped 21 previous similar messages Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 16a sleeping LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) Skipped 40 previous similar messages LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 16a waking LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) Skipped 40 previous similar messages LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 16a awake: rc=4497 LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) Skipped 40 previous similar messages Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true Lustre: 11314:0:(client.c:2321:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1700728104/real 1700728104] req@00000000d5582155 x1783327576143296/t0(0) o13->lustre-OST0003-osc-MDT0003@10.240.43.143@tcp:7/4 lens 224/368 e 0 to 1 dl 1700728111 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-3-3.0' Lustre: 11314:0:(client.c:2321:ptlrpc_expire_one_request()) Skipped 10 previous similar messages Lustre: lustre-OST0003-osc-MDT0003: Connection to lustre-OST0003 (at 10.240.43.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message LustreError: 166-1: MGC10.240.43.26@tcp: Connection to MGS (at 10.240.43.26@tcp) was lost; in progress operations using this service will fail | Link to test |
obdfilter-survey test 1c: Object Storage Targets survey, big batch | watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:33] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk [last unloaded: dm_flakey] CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.19.2.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 c0 6e 22 00 9d 30 c0 e9 b8 6e 22 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 a1 6e 22 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa30b4074bd10 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000002a6d9867 RBX: ffffc47c40a9b640 RCX: 0000000000000200 RDX: 0000000000000000 RSI: ffff8c6faa6d9000 RDI: ffff8c6fd4261000 RBP: 000055c38aa61000 R08: ffffffffffffffff R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff6 R12: ffff8c6f82cb7308 R13: ffff8c6f81cfa800 R14: ffffc47c41509840 R15: ffff8c6f833a9910 FS: 0000000000000000(0000) GS:ffff8c703fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f7d6d555ac CR3: 0000000075010005 CR4: 00000000000606e0 Call Trace: collapse_huge_page+0x8e4/0x1010 ? mutex_lock+0xe/0x30 khugepaged+0xed0/0x11e0 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x10b/0x130 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-425.19.2.el8_lustre.ddn17.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: 10904:0:(client.c:2321:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1699645028/real 1699645028] req@00000000a9eabe2f x1782190199880704/t0(0) o13->lustre-OST0001-osc-MDT0003@10.240.42.238@tcp:7/4 lens 224/368 e 0 to 1 dl 1699645035 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-1-3.0' Lustre: 10904:0:(client.c:2321:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-OST0001-osc-MDT0003: Connection to lustre-OST0001 (at 10.240.42.238@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 3 previous similar messages LustreError: 166-1: MGC10.240.42.241@tcp: Connection to MGS (at 10.240.42.241@tcp) was lost; in progress operations using this service will fail LustreError: Skipped 4 previous similar messages | Link to test |
sanity test 127c: test llite extent stats with regular | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34] Modules linked in: obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4 virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk [last unloaded: llog_test] CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.21.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffaf7fc0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000004f864867 RBX: ffffd8f3c13e1900 RCX: 0000000000000200 RDX: 7fffffffb079b798 RSI: ffff98d7cf864000 RDI: ffff98d7e4aa6000 RBP: 000055a9896a6000 R08: 00000000000396d0 R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffffc R12: ffff98d7ea293530 R13: ffff98d795848000 R14: ffffd8f3c192a980 R15: ffff98d79ed59000 FS: 0000000000000000(0000) GS:ffff98d83fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a98a276000 CR3: 0000000013e10005 CR4: 00000000001706e0 Call Trace: collapse_huge_page+0x8d7/0x1000 khugepaged+0xed9/0x11e0 ? __schedule+0x2d9/0x870 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x134/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.21.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: 9981:0:(client.c:2310:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1695922608/real 1695922608] req@00000000068ea7bb x1778290835190784/t0(0) o13->lustre-OST0004-osc-MDT0000@10.240.29.153@tcp:7/4 lens 224/368 e 0 to 1 dl 1695922624 ref 1 fl Rpc:XQr/200/ffffffff rc 0/-1 job:'osp-pre-4-0.0' uid:0 gid:0 Lustre: lustre-OST0004-osc-MDT0000: Connection to lustre-OST0004 (at 10.240.29.153@tcp) was lost; in progress operations using this service will wait for recovery to complete | Link to test |
replay-single test 70b: dbench 4mdts recovery; 2 clients | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:33] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw failover virtio_blk CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-348.7.1.el8_5.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa42f4074bd20 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000002d2d5825 RBX: ffffea8740b4b540 RCX: 0000000000000200 RDX: 0000000000000000 RSI: ffff9089ad2d5000 RDI: ffff9089b7ad5000 RBP: 0000557c626d5000 R08: ffffffffffffffe9 R09: 0000000000000088 R10: ffffffffffffffff R11: 00000000fffffffe R12: ffff9089829156a8 R13: ffff908a0f4617c0 R14: ffffea8740deb540 R15: ffff908985c67910 FS: 0000000000000000(0000) GS:ffff908a3fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8408f587bc CR3: 000000008de10006 CR4: 00000000000606e0 Call Trace: collapse_huge_page+0x914/0xff0 khugepaged+0xecc/0x11d0 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x116/0x130 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-348.7.1.el8_5.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x5c/0x80 panic+0xe7/0x2a9 ? __switch_to_asm+0x51/0x70 watchdog_timer_fn.cold.9+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x100/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: mkdir -p /mnt/lustre Lustre: DEBUG MARKER: Lustre: DEBUG MARKER: mount | grep /mnt/lustre' ' Lustre: DEBUG MARKER: set -x; MISSING_DBENCH_OK= PATH=/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests/:/usr/share/doc/dbench/loadfiles DBENCH_LIB=/usr/share/doc/dbench/loadfiles Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started rundbench load pid=213562 ... Lustre: DEBUG MARKER: Started rundbench load pid=213562 ... Lustre: DEBUG MARKER: killall -0 dbench Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname) Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 Lustre: 7987:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1672887608/real 1672887608] req@000000007c21fefb x1754142100258752/t0(0) o400->MGC10.240.24.91@tcp@10.240.24.91@tcp:26/25 lens 224/224 e 0 to 1 dl 1672887616 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' LustreError: 166-1: MGC10.240.24.91@tcp: Connection to MGS (at 10.240.24.91@tcp) was lost; in progress operations using this service will fail LustreError: Skipped 2 previous similar messages Lustre: lustre-MDT0002-mdc-ffff908982aab000: Connection to lustre-MDT0002 (at 10.240.24.91@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 2 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70b fail mds1 1 times | Link to test |
recovery-mds-scale test failover_mds: failover MDS | watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:33] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sd_mod t10_pi sg iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel virtio_blk serio_raw net_failover failover CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-372.32.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 c0 fb 23 00 9d 30 c0 e9 b8 fb 23 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 a1 fb 23 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffa9bf8074bd10 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000000d04e867 RBX: ffffd5ad80341380 RCX: 0000000000000200 RDX: 0000000000000000 RSI: ffff91790d04e000 RDI: ffff917900284000 RBP: 000055cc59c84000 R08: ffffffffffffffff R09: 0000000000000011 R10: 0000000000000007 R11: 00000000fffffff7 R12: ffff91794df36420 R13: ffff91797446d000 R14: ffffd5ad8000a100 R15: ffff917905c2d9f8 FS: 0000000000000000(0000) GS:ffff9179bfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000559bb46731ac CR3: 0000000072a10006 CR4: 00000000000606e0 Call Trace: collapse_huge_page+0x8e4/0x1010 khugepaged+0xecf/0x11e0 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x10a/0x120 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-372.32.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x70 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x100/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on onyx-42vm8 Lustre: DEBUG MARKER: Started client load: dd on onyx-42vm8 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on onyx-42vm9 Lustre: DEBUG MARKER: Started client load: tar on onyx-42vm9 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670956314/real 1670956314] req@0000000083751d84 x1752124496070976/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670956321 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 1 previous similar message Lustre: lustre-OST0004: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 2 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0000: Received new MDS connection from 10.240.23.150@tcp, remove former export from same NID Lustre: Skipped 2 previous similar messages Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670956332/real 1670956350] req@00000000474257a5 x1752124496073536/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670956376 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x2c450038ab2a1d7b to 0xc9cf0cbefe9ea54f Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 1 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 1 times, and counting... Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670956314/real 1670956314] req@00000000125812a3 x1752124496071040/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670956358 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 29 previous similar messages Lustre: lustre-OST0005: deleting orphan objects from 0x0:49 to 0x0:65 Lustre: lustre-OST0002: deleting orphan objects from 0x0:49 to 0x0:65 Lustre: lustre-OST0006: deleting orphan objects from 0x0:49 to 0x0:65 Lustre: lustre-OST0001: deleting orphan objects from 0x0:49 to 0x0:65 Lustre: lustre-OST0004: deleting orphan objects from 0x0:48 to 0x0:65 Lustre: lustre-OST0003: deleting orphan objects from 0x0:48 to 0x0:65 Lustre: lustre-OST0000: deleting orphan objects from 0x0:49 to 0x0:65 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670956319/real 1670956319] req@00000000be1be6fa x1752124496071552/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670956363 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670956326/real 1670956326] req@000000002933c035 x1752124496072512/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670956370 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 10 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=62 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=62 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670957518/real 1670957518] req@00000000e46e732b x1752124496455872/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670957525 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 2 previous similar messages Lustre: lustre-OST0002: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 2 previous similar messages Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670957537/real 1670957556] req@000000005325c217 x1752124496458496/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670957583 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 6 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670957542/real 1670957560] req@00000000ddc349d1 x1752124496458944/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670957588 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: 36949:0:(service.c:2345:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/1s); client may timeout req@00000000592c448f x1752125968220800/t0(0) o8-><?>@<unknown>:0/0 lens 520/416 e 0 to 0 dl 1670957560 ref 1 fl Complete:/0/0 rc -114/-114 job:'' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670957518/real 1670957518] req@0000000071ac08ab x1752124496455936/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670957564 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0xc9cf0cbefe9ea54f to 0x3359c7c17f295d00 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 7 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 2 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 2 times, and counting... Lustre: lustre-OST0004: deleting orphan objects from 0x0:600 to 0x0:705 Lustre: lustre-OST0003: deleting orphan objects from 0x0:720 to 0x0:833 Lustre: lustre-OST0000: deleting orphan objects from 0x0:636 to 0x0:705 Lustre: lustre-OST0002: deleting orphan objects from 0x0:632 to 0x0:705 Lustre: lustre-OST0005: deleting orphan objects from 0x0:578 to 0x0:641 Lustre: lustre-OST0001: deleting orphan objects from 0x0:680 to 0x0:769 Lustre: lustre-OST0006: deleting orphan objects from 0x0:823 to 0x0:961 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670957530/real 1670957530] req@00000000f157610b x1752124496457600/t0(0) o400->lustre-MDT0000-lwp-OST0003@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670957576 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 44 previous similar messages Lustre: lustre-OST0000: Export 00000000515a5654 already connecting from 10.240.23.192@tcp Lustre: lustre-OST0000: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting Lustre: lustre-OST0001: Export 00000000756ae6ba already connecting from 10.240.23.192@tcp Lustre: lustre-OST0001: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting Lustre: lustre-OST0002: Export 00000000780f6cbc already connecting from 10.240.23.192@tcp Lustre: lustre-OST0002: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting Lustre: lustre-OST0004: Export 00000000934f6a39 already connecting from 10.240.23.192@tcp Lustre: lustre-OST0004: Export 00000000934f6a39 already connecting from 10.240.23.192@tcp Lustre: lustre-OST0004: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=1276 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=1276 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670958715/real 1670958715] req@000000000a5c192c x1752124496890688/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670958722 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 8 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 3 previous similar messages Lustre: lustre-OST0004: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 3 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670958744/real 1670958754] req@00000000046c334d x1752124496894272/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670958788 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0002: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 7 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x3359c7c17f295d00 to 0x72c4ab34c3498060 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0001: deleting orphan objects from 0x0:1429 to 0x0:1473 Lustre: lustre-OST0002: deleting orphan objects from 0x0:1667 to 0x0:1729 Lustre: lustre-OST0005: deleting orphan objects from 0x0:1360 to 0x0:1441 Lustre: lustre-OST0003: deleting orphan objects from 0x0:1509 to 0x0:1633 Lustre: lustre-OST0004: deleting orphan objects from 0x0:1341 to 0x0:1505 Lustre: lustre-OST0000: deleting orphan objects from 0x0:1269 to 0x0:1345 Lustre: lustre-OST0006: deleting orphan objects from 0x0:1604 to 0x0:1697 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670958715/real 1670958715] req@000000006a500430 x1752124496890816/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670958759 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 30 previous similar messages Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 3 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 3 times, and counting... Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670958727/real 1670958727] req@00000000722f25f5 x1752124496892288/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670958771 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 16 previous similar messages connection1:0: detected conn error (1020) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=2470 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=2470 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670959921/real 1670959921] req@00000000d1c8ecc0 x1752124497328000/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670959928 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 2 previous similar messages Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: Skipped 1 previous similar message Lustre: lustre-OST0002: Export 0000000093c4dca7 already connecting from 10.240.23.149@tcp Lustre: Skipped 1 previous similar message Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670959944/real 1670959963] req@0000000005137350 x1752124497331328/t0(0) o400->lustre-MDT0000-lwp-OST0005@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670959990 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: lustre-MDT0000-lwp-OST0005: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 5 previous similar messages Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0002: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670959921/real 1670959921] req@000000005e086b00 x1752124497328064/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670959967 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 4 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 4 times, and counting... Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x72c4ab34c3498060 to 0xd62eb1972448fc10 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 7 previous similar messages Lustre: lustre-OST0002: deleting orphan objects from 0x0:2334 to 0x0:2465 Lustre: lustre-OST0004: deleting orphan objects from 0x0:2117 to 0x0:2209 Lustre: lustre-OST0001: deleting orphan objects from 0x0:2127 to 0x0:2209 Lustre: lustre-OST0000: deleting orphan objects from 0x0:2002 to 0x0:2081 Lustre: lustre-OST0003: deleting orphan objects from 0x0:2255 to 0x0:2369 Lustre: lustre-OST0006: deleting orphan objects from 0x0:2301 to 0x0:2433 Lustre: lustre-OST0005: deleting orphan objects from 0x0:2100 to 0x0:2177 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670959933/real 1670959933] req@000000002010db62 x1752124497329536/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670959979 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 51 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=3678 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=3678 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670961120/real 1670961120] req@000000000711473d x1752124497763264/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670961127 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 17 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 4 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0003: deleting orphan objects from 0x0:3132 to 0x0:3169 Lustre: lustre-OST0001: deleting orphan objects from 0x0:3000 to 0x0:3137 Lustre: lustre-OST0000: deleting orphan objects from 0x0:2846 to 0x0:2945 Lustre: lustre-OST0006: deleting orphan objects from 0x0:3299 to 0x0:3361 Lustre: lustre-OST0002: deleting orphan objects from 0x0:3241 to 0x0:3361 Lustre: lustre-OST0005: deleting orphan objects from 0x0:2943 to 0x0:3041 Lustre: lustre-OST0004: deleting orphan objects from 0x0:3006 to 0x0:3073 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670961144/real 1670961158] req@00000000a2b3621e x1752124497766784/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670961191 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0006: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 6 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xd62eb1972448fc10 to 0x20d5d1f7f970463d Lustre: lustre-MDT0000-lwp-OST0001: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 5 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 5 times, and counting... Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670961120/real 1670961120] req@0000000079b128b7 x1752124497763328/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670961167 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 26 previous similar messages Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670961127/real 1670961127] req@000000009d89b971 x1752124497764544/t0(0) o400->lustre-MDT0000-lwp-OST0003@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670961174 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 17 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=4868 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=4868 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670962326/real 1670962326] req@00000000dd29ada6 x1752124498303488/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670962333 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 23 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670962344/real 1670962363] req@00000000f1d2af20 x1752124498306112/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670962390 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 5 previous similar messages Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670962349/real 1670962367] req@00000000a7771c63 x1752124498306560/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670962395 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 6 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 6 times, and counting... Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670962349/real 1670962376] req@00000000015da59a x1752124498306496/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670962395 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 16 previous similar messages Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x20d5d1f7f970463d to 0x90e0347dd307f316 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 7 previous similar messages Lustre: lustre-OST0000: deleting orphan objects from 0x0:3844 to 0x0:3937 Lustre: lustre-OST0001: deleting orphan objects from 0x0:3942 to 0x0:4129 Lustre: lustre-OST0005: deleting orphan objects from 0x0:3904 to 0x0:4033 Lustre: lustre-OST0004: deleting orphan objects from 0x0:3961 to 0x0:4065 Lustre: lustre-OST0003: deleting orphan objects from 0x0:4037 to 0x0:4161 Lustre: lustre-OST0002: deleting orphan objects from 0x0:4119 to 0x0:4289 Lustre: lustre-OST0006: deleting orphan objects from 0x0:4177 to 0x0:4353 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=6079 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=6079 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670963518/real 1670963518] req@000000009ec0395c x1752124498738240/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670963525 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 45 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0001: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670963536/real 1670963554] req@000000002d0d1c97 x1752124498740800/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670963580 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 10 previous similar messages Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670963541/real 1670963558] req@000000005937a913 x1752124498741248/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670963585 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 7 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 7 times, and counting... Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670963523/real 1670963523] req@00000000793b7852 x1752124498739136/t0(0) o400->lustre-MDT0000-lwp-OST0005@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670963567 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x90e0347dd307f316 to 0x5d59d660893718a7 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 8 previous similar messages Lustre: lustre-OST0002: deleting orphan objects from 0x0:5156 to 0x0:5345 Lustre: lustre-OST0001: deleting orphan objects from 0x0:4919 to 0x0:4993 Lustre: lustre-OST0003: deleting orphan objects from 0x0:4968 to 0x0:5089 Lustre: lustre-OST0000: deleting orphan objects from 0x0:4675 to 0x0:4833 Lustre: lustre-OST0004: deleting orphan objects from 0x0:4868 to 0x0:5025 Lustre: lustre-OST0005: deleting orphan objects from 0x0:4804 to 0x0:4929 Lustre: lustre-OST0006: deleting orphan objects from 0x0:5130 to 0x0:5281 Lustre: ll_ost_io00_002: service thread pid 16621 was inactive for 62.657 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 16621, comm: ll_ost_io00_002 4.18.0-372.32.1.el8_lustre.x86_64 #1 SMP Thu Oct 27 18:54:42 UTC 2022 Call Trace TBD: [<0>] cv_wait_common+0xaf/0x130 [spl] [<0>] txg_wait_synced_impl+0xc6/0x110 [zfs] [<0>] txg_wait_synced+0xc/0x40 [zfs] [<0>] dmu_tx_wait+0x377/0x390 [zfs] [<0>] dmu_tx_assign+0x157/0x470 [zfs] [<0>] osd_trans_start+0x1b7/0x430 [osd_zfs] [<0>] ofd_write_attr_set+0x11d/0x1070 [ofd] [<0>] ofd_commitrw_write+0x205/0x1a70 [ofd] [<0>] ofd_commitrw+0x5f0/0xd70 [ofd] [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc] [<0>] tgt_brw_write+0x153f/0x1ad0 [ptlrpc] [<0>] tgt_request_handle+0xc90/0x19c0 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] [<0>] kthread+0x10a/0x120 [<0>] ret_from_fork+0x35/0x40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=7270 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=7270 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670964730/real 1670964730] req@00000000a2c96647 x1752124499072128/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670964737 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 54 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 5 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670964749/real 1670964768] req@000000000a86c8c0 x1752124499074752/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670964793 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 3 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670964754/real 1670964772] req@000000003f5caa9f x1752124499075136/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670964798 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 8 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 8 times, and counting... Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670964764/real 1670964780] req@0000000070fe4b84 x1752124499076288/t0(0) o400->lustre-MDT0000-lwp-OST0003@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670964808 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 14 previous similar messages Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x5d59d660893718a7 to 0x3ab30235b82c90f5 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0001: deleting orphan objects from 0x0:5382 to 0x0:5441 Lustre: lustre-OST0000: deleting orphan objects from 0x0:5201 to 0x0:5281 Lustre: lustre-OST0002: deleting orphan objects from 0x0:5865 to 0x0:6049 Lustre: lustre-OST0006: deleting orphan objects from 0x0:5650 to 0x0:5729 Lustre: lustre-OST0005: deleting orphan objects from 0x0:5325 to 0x0:5377 Lustre: lustre-OST0004: deleting orphan objects from 0x0:5418 to 0x0:5473 Lustre: lustre-OST0003: deleting orphan objects from 0x0:5458 to 0x0:5537 Lustre: lustre-OST0006: Export 000000006ed2958b already connecting from 10.240.23.192@tcp Lustre: lustre-OST0006: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=8485 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=8485 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670965927/real 1670965927] req@00000000bae788b6 x1752124499506496/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670965934 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 30 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670965927/real 1670965927] req@00000000a6417e9c x1752124499506560/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670965938 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 5 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670965932/real 1670965932] req@000000000d5229a5 x1752124499507072/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670965943 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x3ab30235b82c90f5 to 0x1cd08a9d1032c7de Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 7 previous similar messages Lustre: lustre-OST0003: deleting orphan objects from 0x0:6166 to 0x0:6241 Lustre: lustre-OST0004: deleting orphan objects from 0x0:6098 to 0x0:6177 Lustre: lustre-OST0002: deleting orphan objects from 0x0:6759 to 0x0:6945 Lustre: lustre-OST0001: deleting orphan objects from 0x0:6133 to 0x0:6273 Lustre: lustre-OST0000: deleting orphan objects from 0x0:5961 to 0x0:6145 Lustre: lustre-OST0005: deleting orphan objects from 0x0:6068 to 0x0:6145 Lustre: lustre-OST0006: deleting orphan objects from 0x0:6413 to 0x0:6593 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 9 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 9 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=9680 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=9680 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670967134/real 1670967134] req@000000002c0c3d68 x1752124499890560/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670967141 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 5 previous similar messages Lustre: lustre-OST0001: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0005: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0004: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: Skipped 1 previous similar message Lustre: lustre-OST0000: Export 00000000e9d71d58 already connecting from 10.240.23.149@tcp Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670967157/real 1670967168] req@00000000ba54c862 x1752124499893568/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670967203 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 6 previous similar messages Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x1cd08a9d1032c7de to 0xdd76e30dbe9412d8 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 7 previous similar messages Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0000: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670967134/real 1670967134] req@00000000f34c6d2d x1752124499890624/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670967180 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 26 previous similar messages Lustre: lustre-OST0000: deleting orphan objects from 0x0:6766 to 0x0:6881 Lustre: lustre-OST0001: deleting orphan objects from 0x0:6920 to 0x0:7009 Lustre: lustre-OST0005: deleting orphan objects from 0x0:6764 to 0x0:6881 Lustre: lustre-OST0004: deleting orphan objects from 0x0:6797 to 0x0:6913 Lustre: lustre-OST0002: deleting orphan objects from 0x0:7558 to 0x0:7681 Lustre: lustre-OST0006: deleting orphan objects from 0x0:7233 to 0x0:7329 Lustre: lustre-OST0003: deleting orphan objects from 0x0:6851 to 0x0:6913 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670967140/real 1670967140] req@00000000d763aa97 x1752124499891136/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670967186 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 10 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 10 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=10898 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=10898 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670968325/real 1670968325] req@000000001ac96f64 x1752124500429312/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670968332 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 27 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670968325/real 1670968325] req@0000000015518c45 x1752124500429440/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670968333 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670968325/real 1670968325] req@0000000051ff97cd x1752124500429376/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670968333 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 6 previous similar messages Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670968330/real 1670968330] req@000000008da5aff1 x1752124500429952/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670968338 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 5 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xdd76e30dbe9412d8 to 0x7bf8ea38d6048ee1 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0003: deleting orphan objects from 0x0:7739 to 0x0:7777 Lustre: lustre-OST0002: deleting orphan objects from 0x0:8565 to 0x0:8705 Lustre: lustre-OST0001: deleting orphan objects from 0x0:7848 to 0x0:7937 Lustre: lustre-OST0000: deleting orphan objects from 0x0:7761 to 0x0:7841 Lustre: lustre-OST0006: deleting orphan objects from 0x0:8336 to 0x0:8417 Lustre: lustre-OST0005: deleting orphan objects from 0x0:7862 to 0x0:7969 Lustre: lustre-OST0004: deleting orphan objects from 0x0:7898 to 0x0:8001 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 11 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 11 times, and counting... Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: lustre-OST0003: Export 0000000042a16b27 already connecting from 10.240.23.192@tcp Lustre: lustre-OST0003: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting Lustre: lustre-OST0003: Export 00000000ece6f824 already connecting from 10.240.23.192@tcp Lustre: lustre-OST0003: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=12075 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=12075 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670969523/real 1670969523] req@00000000f12ca16d x1752124500966912/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670969530 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670969523/real 1670969523] req@00000000725694ff x1752124500967360/t0(0) o400->lustre-MDT0000-lwp-OST0006@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670969531 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 2 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670969528/real 1670969528] req@00000000da258bd4 x1752124500967872/t0(0) o400->lustre-MDT0000-lwp-OST0006@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670969536 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670969530/real 1670969530] req@00000000ca109534 x1752124500968064/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670969538 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0004: Export 000000007d9dcb9e already connecting from 10.240.23.149@tcp Lustre: Skipped 1 previous similar message Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x7bf8ea38d6048ee1 to 0xcfa98585f718d913 Lustre: lustre-OST0004: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID Lustre: Skipped 1 previous similar message Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0000: deleting orphan objects from 0x0:8783 to 0x0:8865 Lustre: lustre-OST0004: deleting orphan objects from 0x0:8881 to 0x0:8961 Lustre: lustre-OST0003: deleting orphan objects from 0x0:8495 to 0x0:8577 Lustre: lustre-OST0002: deleting orphan objects from 0x0:9651 to 0x0:9729 Lustre: lustre-OST0001: deleting orphan objects from 0x0:8790 to 0x0:8833 Lustre: lustre-OST0005: deleting orphan objects from 0x0:8799 to 0x0:8865 Lustre: lustre-OST0006: deleting orphan objects from 0x0:9300 to 0x0:9377 Lustre: lustre-MDT0000-lwp-OST0005: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 12 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 12 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=13287 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=13287 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670970730/real 1670970730] req@000000000be1824f x1752124501454208/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670970737 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Lustre: Skipped 10 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670970735/real 1670970735] req@000000000ac2f95f x1752124501454784/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670970742 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 10 previous similar messages Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0006: deleting orphan objects from 0x0:10249 to 0x0:10337 Lustre: lustre-OST0005: deleting orphan objects from 0x0:9711 to 0x0:9761 Lustre: lustre-OST0004: deleting orphan objects from 0x0:9770 to 0x0:9825 Lustre: lustre-OST0003: deleting orphan objects from 0x0:9325 to 0x0:9377 Lustre: lustre-OST0002: deleting orphan objects from 0x0:10549 to 0x0:10625 Lustre: lustre-OST0001: deleting orphan objects from 0x0:9614 to 0x0:9793 Lustre: lustre-OST0000: deleting orphan objects from 0x0:9668 to 0x0:9761 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 6 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xcfa98585f718d913 to 0x47d893cda4104e2b Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 13 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 13 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=14480 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=14480 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670971922/real 1670971922] req@000000006d76acd4 x1752124501990784/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670971929 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670971940/real 1670971959] req@00000000e33304d6 x1752124501993536/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971986 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0004: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 2 previous similar messages Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670971945/real 1670971963] req@0000000003fa5e40 x1752124501994304/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971991 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670971922/real 1670971922] req@000000001fee5a31 x1752124501990848/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971968 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670971945/real 1670971972] req@00000000085fcee2 x1752124501994240/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971991 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 19 previous similar messages Lustre: lustre-MDT0000-lwp-OST0001: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 7 previous similar messages Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x47d893cda4104e2b to 0x613ab883dcd550ef Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: lustre-OST0001: deleting orphan objects from 0x0:10766 to 0x0:10913 Lustre: lustre-OST0000: deleting orphan objects from 0x0:10727 to 0x0:10913 Lustre: lustre-OST0006: deleting orphan objects from 0x0:11315 to 0x0:11457 Lustre: lustre-OST0004: deleting orphan objects from 0x0:10792 to 0x0:10945 Lustre: lustre-OST0003: deleting orphan objects from 0x0:10327 to 0x0:10401 Lustre: lustre-OST0002: deleting orphan objects from 0x0:11588 to 0x0:11745 Lustre: lustre-OST0005: deleting orphan objects from 0x0:10727 to 0x0:10881 Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 14 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 14 times, and counting... Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670971934/real 1670971934] req@000000006cb6e88b x1752124501992512/t0(0) o400->lustre-MDT0000-lwp-OST0003@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971980 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 31 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=15683 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=15683 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670973124/real 1670973124] req@0000000015a5718d x1752124502632832/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670973131 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 9 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: Skipped 2 previous similar messages Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: Skipped 1 previous similar message Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670973142/real 1670973158] req@00000000d3667447 x1752124502635456/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670973188 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 3 previous similar messages Lustre: lustre-OST0003: Export 0000000098559874 already connecting from 10.240.23.150@tcp Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0003: Received new MDS connection from 10.240.23.150@tcp, remove former export from same NID Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670973147/real 1670973167] req@000000004ecfde8b x1752124502635840/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670973193 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 15 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 15 times, and counting... Lustre: lustre-MDT0000-lwp-OST0005: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 7 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x613ab883dcd550ef to 0x73949b8b1e7497e6 Lustre: lustre-OST0000: deleting orphan objects from 0x0:12039 to 0x0:12225 Lustre: lustre-OST0003: deleting orphan objects from 0x0:11439 to 0x0:11521 Lustre: lustre-OST0001: deleting orphan objects from 0x0:12065 to 0x0:12097 Lustre: lustre-OST0005: deleting orphan objects from 0x0:11988 to 0x0:12065 Lustre: lustre-OST0006: deleting orphan objects from 0x0:12547 to 0x0:12641 Lustre: lustre-OST0004: deleting orphan objects from 0x0:12017 to 0x0:12129 Lustre: lustre-OST0002: deleting orphan objects from 0x0:12958 to 0x0:13185 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670973129/real 1670973129] req@000000005911a6a9 x1752124502633472/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670973175 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 27 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=16874 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=16874 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670974326/real 1670974326] req@000000002d4ed59b x1752124503017152/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670974333 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 27 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 6 previous similar messages Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670974331/real 1670974331] req@000000002648476d x1752124503017728/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670974339 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 4 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0004: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 7 previous similar messages Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x73949b8b1e7497e6 to 0x71520c8d34a640c9 Lustre: lustre-OST0002: deleting orphan objects from 0x0:13756 to 0x0:13825 Lustre: lustre-OST0005: deleting orphan objects from 0x0:12625 to 0x0:12705 Lustre: lustre-OST0000: deleting orphan objects from 0x0:12837 to 0x0:13025 Lustre: lustre-OST0001: deleting orphan objects from 0x0:12657 to 0x0:12737 Lustre: lustre-OST0003: deleting orphan objects from 0x0:12136 to 0x0:12321 Lustre: lustre-OST0004: deleting orphan objects from 0x0:12713 to 0x0:12801 Lustre: lustre-OST0006: deleting orphan objects from 0x0:13211 to 0x0:13281 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 16 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 16 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=18078 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=18078 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975528/real 1670975528] req@00000000ff299ae7 x1752124503555072/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670975535 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670975551/real 1670975566] req@00000000343cd71e x1752124503558080/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975595 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 6 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x71520c8d34a640c9 to 0xc7e5116db274ea8e Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 8 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 17 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 17 times, and counting... Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975528/real 1670975528] req@0000000011efd632 x1752124503555136/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975572 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 26 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975533/real 1670975533] req@000000009aa7600f x1752124503555712/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975577 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 10 previous similar messages Lustre: lustre-OST0003: deleting orphan objects from 0x0:13284 to 0x0:13409 Lustre: lustre-OST0002: deleting orphan objects from 0x0:14736 to 0x0:14849 Lustre: lustre-OST0004: deleting orphan objects from 0x0:13716 to 0x0:13825 Lustre: lustre-OST0001: deleting orphan objects from 0x0:13646 to 0x0:13761 Lustre: lustre-OST0005: deleting orphan objects from 0x0:13663 to 0x0:13729 Lustre: lustre-OST0006: deleting orphan objects from 0x0:14190 to 0x0:14305 Lustre: lustre-OST0000: deleting orphan objects from 0x0:13937 to 0x0:14049 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975541/real 1670975541] req@000000006e4b0e09 x1752124503556672/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975585 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975541/real 1670975541] req@00000000898b2c84 x1752124503556608/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975585 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=19277 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=19277 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670976733/real 1670976733] req@000000003b58e812 x1752124503992384/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670976740 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 10 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670976757/real 1670976771] req@00000000c84ce29b x1752124503995968/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670976801 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 2 previous similar messages Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0xc7e5116db274ea8e to 0x2b7f3f160f5c0574 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 7 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670976733/real 1670976733] req@0000000049adc9e8 x1752124503992448/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670976777 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 26 previous similar messages Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 18 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 18 times, and counting... Lustre: lustre-OST0000: deleting orphan objects from 0x0:14776 to 0x0:14913 Lustre: lustre-OST0001: deleting orphan objects from 0x0:14505 to 0x0:14689 Lustre: lustre-OST0002: deleting orphan objects from 0x0:15587 to 0x0:15649 Lustre: lustre-OST0003: deleting orphan objects from 0x0:14147 to 0x0:14337 Lustre: lustre-OST0006: deleting orphan objects from 0x0:15023 to 0x0:15105 Lustre: lustre-OST0004: deleting orphan objects from 0x0:14504 to 0x0:14625 Lustre: lustre-OST0005: deleting orphan objects from 0x0:14463 to 0x0:14529 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670976745/real 1670976745] req@000000008f3c2279 x1752124503993920/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670976789 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=20486 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=20486 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670977934/real 1670977934] req@000000008e512e0c x1752124504463424/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670977941 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 5 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670977952/real 1670977965] req@00000000598f46e2 x1752124504465984/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670977998 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: lustre-MDT0000-lwp-OST0005: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 5 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x2b7f3f160f5c0574 to 0xcb7fda250d948c28 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 19 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 19 times, and counting... Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670977934/real 1670977934] req@000000001c9c8956 x1752124504463552/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670977980 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 30 previous similar messages Lustre: lustre-OST0006: deleting orphan objects from 0x0:15805 to 0x0:15873 Lustre: lustre-OST0005: deleting orphan objects from 0x0:15294 to 0x0:15361 Lustre: lustre-OST0004: deleting orphan objects from 0x0:15374 to 0x0:15457 Lustre: lustre-OST0003: deleting orphan objects from 0x0:15114 to 0x0:15297 Lustre: lustre-OST0002: deleting orphan objects from 0x0:16442 to 0x0:16577 Lustre: lustre-OST0001: deleting orphan objects from 0x0:15553 to 0x0:15649 Lustre: lustre-OST0000: deleting orphan objects from 0x0:15710 to 0x0:15841 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670977946/real 1670977946] req@0000000049d48248 x1752124504464960/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670977992 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 19 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=21681 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=21681 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670979128/real 1670979128] req@000000000608e7be x1752124505000640/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670979135 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 3 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0001: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0006: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: Skipped 1 previous similar message Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670979146/real 1670979163] req@000000008658bda6 x1752124505003200/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670979190 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670979151/real 1670979167] req@00000000822c813a x1752124505003712/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670979195 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 20 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 20 times, and counting... Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670979133/real 1670979133] req@0000000065f1bd72 x1752124505001216/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670979177 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0xcb7fda250d948c28 to 0xf3e6320145a82e53 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 7 previous similar messages Lustre: lustre-OST0004: deleting orphan objects from 0x0:16346 to 0x0:16385 Lustre: lustre-OST0003: deleting orphan objects from 0x0:16149 to 0x0:16225 Lustre: lustre-OST0002: deleting orphan objects from 0x0:17420 to 0x0:17505 Lustre: lustre-OST0001: deleting orphan objects from 0x0:16505 to 0x0:16577 Lustre: lustre-OST0000: deleting orphan objects from 0x0:16757 to 0x0:16833 Lustre: lustre-OST0006: deleting orphan objects from 0x0:16958 to 0x0:17313 Lustre: lustre-OST0005: deleting orphan objects from 0x0:16397 to 0x0:16481 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=22882 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=22882 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670980335/real 1670980335] req@000000007b1f7887 x1752124505507200/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670980342 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 54 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 7 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0004: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: Skipped 1 previous similar message Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670980354/real 1670980370] req@00000000debfcc42 x1752124505510208/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670980398 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 9 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xf3e6320145a82e53 to 0x51496d62371084ed Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 5 previous similar messages Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670980335/real 1670980335] req@000000009743021d x1752124505507264/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670980379 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Lustre: lustre-OST0006: deleting orphan objects from 0x0:18075 to 0x0:18113 Lustre: lustre-OST0005: deleting orphan objects from 0x0:17273 to 0x0:17441 Lustre: lustre-OST0004: deleting orphan objects from 0x0:17137 to 0x0:17281 Lustre: lustre-OST0003: deleting orphan objects from 0x0:17108 to 0x0:17281 Lustre: lustre-OST0002: deleting orphan objects from 0x0:18328 to 0x0:18465 Lustre: lustre-OST0001: deleting orphan objects from 0x0:17555 to 0x0:17633 Lustre: lustre-OST0000: deleting orphan objects from 0x0:17522 to 0x0:17569 Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670980347/real 1670980347] req@0000000006c0d807 x1752124505508736/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670980391 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 41 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 21 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 21 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=24105 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=24105 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670981528/real 1670981528] req@00000000bb3ac64d x1752124505942528/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670981535 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 5 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0006: Export 0000000031b1e33a already connecting from 10.240.23.149@tcp Lustre: lustre-OST0006: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: Skipped 1 previous similar message Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670981551/real 1670981567] req@000000004e5f1da8 x1752124505945664/t0(0) o400->lustre-MDT0000-lwp-OST0002@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670981595 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 3 previous similar messages Lustre: lustre-OST0006: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670981552/real 1670981572] req@00000000628678a3 x1752124505946112/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670981596 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 22 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 22 times, and counting... Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670981552/real 1670981583] req@000000008d6c1aa2 x1752124505946048/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670981596 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 23 previous similar messages Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x51496d62371084ed to 0xf280a8013c3b1c17 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0000: deleting orphan objects from 0x0:18430 to 0x0:18785 Lustre: lustre-OST0001: deleting orphan objects from 0x0:18400 to 0x0:18529 Lustre: lustre-OST0002: deleting orphan objects from 0x0:19375 to 0x0:19553 Lustre: lustre-OST0006: deleting orphan objects from 0x0:18820 to 0x0:18913 Lustre: lustre-OST0005: deleting orphan objects from 0x0:18192 to 0x0:18337 Lustre: lustre-OST0003: deleting orphan objects from 0x0:18059 to 0x0:18113 Lustre: lustre-OST0004: deleting orphan objects from 0x0:18047 to 0x0:18177 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=25286 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=25286 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670982745/real 1670982745] req@00000000e07b6f6e x1752124506482624/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670982752 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 37 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 4 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670982763/real 1670982782] req@00000000a3489656 x1752124506485248/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670982809 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 6 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670982768/real 1670982786] req@00000000868e7051 x1752124506485696/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670982814 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670982768/real 1670982795] req@00000000fbf2880a x1752124506485632/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670982814 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xf280a8013c3b1c17 to 0xa30b6b9752b5a750 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 7 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 23 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 23 times, and counting... Lustre: lustre-OST0000: deleting orphan objects from 0x0:19723 to 0x0:19841 Lustre: lustre-OST0001: deleting orphan objects from 0x0:19392 to 0x0:19457 Lustre: lustre-OST0002: deleting orphan objects from 0x0:20521 to 0x0:20641 Lustre: lustre-OST0004: deleting orphan objects from 0x0:19019 to 0x0:19073 Lustre: lustre-OST0003: deleting orphan objects from 0x0:18949 to 0x0:19009 Lustre: lustre-OST0005: deleting orphan objects from 0x0:19178 to 0x0:19233 Lustre: lustre-OST0006: deleting orphan objects from 0x0:19765 to 0x0:19809 Lustre: lustre-OST0006: Export 00000000a3f31d99 already connecting from 10.240.23.192@tcp Lustre: lustre-OST0006: Export 00000000a3f31d99 already connecting from 10.240.23.192@tcp Lustre: lustre-OST0006: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=26502 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=26502 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670983931/real 1670983931] req@00000000369cbb00 x1752124506967424/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670983938 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 41 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 3 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670983936/real 1670983936] req@00000000063d8a0d x1752124506968000/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670983944 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp Lustre: Skipped 8 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0002: Export 00000000c5a99897 already connecting from 10.240.23.149@tcp Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0xa30b6b9752b5a750 to 0xccf1f9ba37039d5f Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp) Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0002: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 24 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 24 times, and counting... Lustre: lustre-OST0006: deleting orphan objects from 0x0:20707 to 0x0:20769 Lustre: lustre-OST0005: deleting orphan objects from 0x0:20110 to 0x0:20193 Lustre: lustre-OST0004: deleting orphan objects from 0x0:19929 to 0x0:20097 Lustre: lustre-OST0003: deleting orphan objects from 0x0:19751 to 0x0:19809 Lustre: lustre-OST0002: deleting orphan objects from 0x0:21531 to 0x0:21601 Lustre: lustre-OST0001: deleting orphan objects from 0x0:20303 to 0x0:20353 Lustre: lustre-OST0000: deleting orphan objects from 0x0:20711 to 0x0:20801 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=27686 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=27686 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985215/real 1670985215] req@00000000f41150b5 x1752124507410624/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670985222 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp Lustre: Skipped 4 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-OST0003: Export 00000000a6a55e7f already connecting from 10.240.23.150@tcp Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114 Lustre: lustre-OST0003: Received new MDS connection from 10.240.23.150@tcp, remove former export from same NID Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670985238/real 1670985253] req@00000000e757c834 x1752124507413184/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985284 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Skipped 6 previous similar messages Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xccf1f9ba37039d5f to 0xe4b6b0058cd8841 Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp) Lustre: Skipped 7 previous similar messages Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985215/real 1670985215] req@00000000226bcee0 x1752124507410816/t0(0) o400->lustre-MDT0000-lwp-OST0002@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985261 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985215/real 1670985215] req@00000000166701ca x1752124507410752/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985261 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 12 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 12 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 25 times, and counting... Lustre: DEBUG MARKER: mds1 failed over 25 times, and counting... Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985220/real 1670985220] req@0000000076d79b33 x1752124507411456/t0(0) o400->lustre-MDT0000-lwp-OST0004@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985266 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985220/real 1670985220] req@0000000085e5ac9f x1752124507411264/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985266 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-OST0005: deleting orphan objects from 0x0:20860 to 0x0:20897 Lustre: lustre-OST0004: deleting orphan objects from 0x0:20763 to 0x0:20833 Lustre: lustre-OST0003: deleting orphan objects from 0x0:20475 to 0x0:20545 Lustre: lustre-OST0000: deleting orphan objects from 0x0:21462 to 0x0:21505 Lustre: lustre-OST0001: deleting orphan objects from 0x0:21019 to 0x0:21089 Lustre: lustre-OST0006: deleting orphan objects from 0x0:21437 to 0x0:21505 Lustre: lustre-OST0002: deleting orphan objects from 0x0:22261 to 0x0:22305 Lustre: lustre-OST0003: Client lustre-MDT0000-mdtlov_UUID (at 10.240.23.150@tcp) reconnecting | Link to test |
sanity-benchmark test iozone: iozone | watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:33] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon joydev pcspkr ext4 ata_generic mbcache jbd2 ata_piix libata virtio_net crc32c_intel serio_raw virtio_blk net_failover failover CPU: 0 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.3.1.el8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 f0 6e 22 00 9d 30 c0 e9 e8 6e 22 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 d1 6e 22 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffac218074bd10 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000002ddd4867 RBX: ffffcca240b77500 RCX: 0000000000000200 RDX: 0000000000000000 RSI: ffff96526ddd4000 RDI: ffff965269da0000 RBP: 000055a1e43a0000 R08: ffffffffffffffff R09: 0000000000000011 R10: 0000000000000007 R11: 00000000ffffffee R12: ffff965268f56d00 R13: ffff9652a0e55000 R14: ffffcca240a76800 R15: ffff96524363bd98 FS: 0000000000000000(0000) GS:ffff9652ffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056216f8d4008 CR3: 000000005f410006 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x8e4/0x1010 khugepaged+0xed0/0x11e0 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x10b/0x130 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-425.3.1.el8.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x41/0x60 panic+0xe7/0x2ac ? __switch_to_asm+0x51/0x80 watchdog_timer_fn.cold.10+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x101/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1705984kB available, using 5448816kB file size Lustre: DEBUG MARKER: min OST has 1705984kB available, using 5448816kB file size | Link to test |
conf-sanity test 21b: start ost before mds, stop mds first | watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:33] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc dm_mod intel_rapl_msr i2c_piix4 virtio_balloon joydev intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw virtio_blk net_failover failover CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-348.23.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffff9bc24074bd20 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000003d173865 RBX: ffffc5bc80f45cc0 RCX: 0000000000000200 RDX: 0000000000000000 RSI: ffff8a6cfd173000 RDI: ffff8a6cfe076000 RBP: 0000562787276000 R08: ffffffffffffffe7 R09: 0000000000000088 R10: ffffffffffffffff R11: 00000000fffffffa R12: ffff8a6cedd953b0 R13: ffff8a6d22470000 R14: ffffc5bc80f81d80 R15: ffff8a6cef502488 FS: 0000000000000000(0000) GS:ffff8a6d7fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5b96935f10 CR3: 0000000060e10002 CR4: 00000000000606e0 Call Trace: collapse_huge_page+0x914/0xff0 khugepaged+0xecc/0x11d0 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x116/0x130 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-348.23.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x5c/0x80 panic+0xe7/0x2a9 ? __switch_to_asm+0x51/0x70 watchdog_timer_fn.cold.9+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x100/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-39vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: onyx-39vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]\*.ost_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]\*.ost_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: onyx-37vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds2 Lustre: DEBUG MARKER: lsmod | grep zfs >&/dev/null || modprobe zfs; Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt2/mdt2 Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds2; mount -t lustre -o localrecov lustre-mdt2/mdt2 /mnt/lustre-mds2 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt2/mdt2 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt2/mdt2 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: onyx-37vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds4 Lustre: DEBUG MARKER: lsmod | grep zfs >&/dev/null || modprobe zfs; Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt4/mdt4 Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds4; mount -t lustre -o localrecov lustre-mdt4/mdt4 /mnt/lustre-mds4 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4 Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt4/mdt4 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt4/mdt4 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-\*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm8.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid 40 Lustre: DEBUG MARKER: onyx-37vm8.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_min Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40 Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40 Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm8.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid 40 Lustre: DEBUG MARKER: onyx-37vm8.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_min Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 40 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 40 Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 40 Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 40 | Link to test |
performance-sanity test 8: getattr large files | watchdog: BUG: soft lockup - CPU#1 stuck for 21s! [khugepaged:33] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel virtio_balloon pcspkr joydev i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover serio_raw failover virtio_blk [last unloaded: dm_flakey] CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-348.23.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0000:ffffb5964074bd20 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 000000003d187827 RBX: fffffaef40f461c0 RCX: 0000000000000200 RDX: 0000000000000000 RSI: ffff8a3a3d187000 RDI: ffff8a3a7d9a2000 RBP: 00007fe2b23a2000 R08: ffffffffffffffe9 R09: 0000000000000088 R10: ffffffffffffffff R11: 0000000000000000 R12: ffff8a3a38f88d10 R13: ffff8a3abfd65f00 R14: fffffaef41f66880 R15: ffff8a3a5771d570 FS: 0000000000000000(0000) GS:ffff8a3abfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe2b325f000 CR3: 000000000ca10006 CR4: 00000000001706e0 Call Trace: collapse_huge_page+0x914/0xff0 khugepaged+0xecc/0x11d0 ? finish_wait+0x80/0x80 ? collapse_pte_mapped_thp+0x430/0x430 kthread+0x116/0x130 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-348.23.1.el8_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x5c/0x80 panic+0xe7/0x2a9 ? __switch_to_asm+0x51/0x70 watchdog_timer_fn.cold.9+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x100/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-stat-large.sh ====== Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh ====== Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm15.onyx.whamcloud.com: executing check_config_client \/mnt\/lustre Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm16.onyx.whamcloud.com: executing check_config_client \/mnt\/lustre Lustre: DEBUG MARKER: onyx-73vm15.onyx.whamcloud.com: executing check_config_client /mnt/lustre Lustre: DEBUG MARKER: onyx-73vm16.onyx.whamcloud.com: executing check_config_client /mnt/lustre Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-mds2' ' /proc/mounts); Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-mds4' ' /proc/mounts); Lustre: DEBUG MARKER: e2label /dev/mapper/mds2_flakey 2>/dev/null Lustre: DEBUG MARKER: cat /proc/mounts Lustre: DEBUG MARKER: e2label /dev/mapper/mds4_flakey 2>/dev/null Lustre: DEBUG MARKER: cat /proc/mounts Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20 Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: [ -f /sys/module/mgc/parameters/mgc_requeue_timeout_min ] && echo 1 > /sys/module/mgc/parameters/mgc_requeue_timeout_min; exit 0 Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/ Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm16.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm17.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-73vm16.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm18.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-73vm17.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-73vm18.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm19.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm19.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-73vm19.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: onyx-73vm19.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl set_param osd-ldiskfs.track_declares_assert=1 || true Lustre: DEBUG MARKER: params=$(/usr/sbin/lctl get_param mdt.*.enable_remote_dir_gid); && param= || && param="$params"; Lustre: DEBUG MARKER: params=$(/usr/sbin/lctl get_param mdt.*.enable_remote_dir_gid); && param= || && param="$params"; Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.*.enable_remote_dir_gid=-1 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-stat-large.sh Test preparation: creating 598434 files. Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh Test preparation: creating 598434 files. Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-stat-large.sh ### 1 NODE STAT ### Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-stat-large.sh ### 2 NODES STAT ### Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774709/real 1661774709] req@000000007a3774d9 x1742478279705984/t0(0) o13->lustre-OST0007-osc-MDT0003@10.240.26.52@tcp:7/4 lens 224/368 e 0 to 1 dl 1661774719 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-7-3.0' Lustre: lustre-OST0007-osc-MDT0003: Connection to lustre-OST0007 (at 10.240.26.52@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774715/real 1661774715] req@0000000084e9e95a x1742478279706176/t0(0) o13->lustre-OST0001-osc-MDT0003@10.240.26.52@tcp:7/4 lens 224/368 e 0 to 1 dl 1661774725 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-1-3.0' Lustre: lustre-OST0001-osc-MDT0003: Connection to lustre-OST0001 (at 10.240.26.52@tcp) was lost; in progress operations using this service will wait for recovery to complete LNetError: 10482:0:(socklnd.c:1531:ksocknal_destroy_conn()) Incomplete receive of lnet header from 12345-10.240.26.52@tcp, ip 10.240.26.52:1023, with error, protocol: 3.x. Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774723/real 1661774723] req@00000000a7c123c5 x1742478279706304/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.240.26.53@tcp:24/4 lens 224/224 e 0 to 1 dl 1661774733 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 10.240.26.53@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 2 previous similar messages LustreError: 166-1: MGC10.240.26.53@tcp: Connection to MGS (at 10.240.26.53@tcp) was lost; in progress operations using this service will fail Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774726/real 1661774726] req@000000000f212a47 x1742478279707968/t0(0) o400->lustre-MDT0002-osp-MDT0001@10.240.26.53@tcp:24/4 lens 224/224 e 0 to 1 dl 1661774736 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 19 previous similar messages Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774731/real 1661774731] req@00000000fbf46214 x1742478279709376/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.240.26.53@tcp:24/4 lens 224/224 e 0 to 1 dl 1661774741 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 15 previous similar messages LNetError: 10479:0:(socklnd_cb.c:1779:ksocknal_recv_hello()) Error -104 reading HELLO from 10.240.26.52 LNetError: 11b-b: Connection to 10.240.26.52@tcp at host 10.240.26.52:7988 was reset: is it running a compatible version of Lustre and is 10.240.26.52@tcp one of its NIDs? Lustre: lustre-MDT0001: Received new MDS connection from 10.240.26.53@tcp, keep former export from same NID Lustre: Skipped 1 previous similar message Lustre: lustre-MDT0001: Client lustre-MDT0001-lwp-OST0000_UUID (at 10.240.26.52@tcp) reconnecting Lustre: lustre-MDT0003: Client lustre-MDT0003-lwp-OST0000_UUID (at 10.240.26.52@tcp) reconnecting Lustre: lustre-MDT0001: Client lustre-MDT0001-lwp-OST0001_UUID (at 10.240.26.52@tcp) reconnecting Lustre: Skipped 1 previous similar message Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774746/real 1661774746] req@0000000091fa10c7 x1742478279715200/t0(0) o400->lustre-MDT0003-osp-MDT0001@0@lo:24/4 lens 224/224 e 0 to 1 dl 1661774757 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 28 previous similar messages Lustre: lustre-MDT0003-osp-MDT0001: Connection to lustre-MDT0003 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 15 previous similar messages Lustre: 25229:0:(service.c:2156:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 8s req@00000000da6ac881 x1742478279715904/t0(0) o41->lustre-MDT0001-mdtlov_UUID@0@lo:0/0 lens 224/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 job:'osp-pre-3-1.0' Lustre: mdt_out: This server is not able to keep up with request traffic (cpu-bound). Lustre: 25229:0:(service.c:1612:ptlrpc_at_check_timed()) earlyQ=4 reqQ=0 recA=2, svcEst=2, delay=8216ms Lustre: 25229:0:(service.c:1378:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-1s), not sending early reply. Consider increasing at_early_margin (5)? req@000000009109ac18 x1742478279715840/t0(0) o41->lustre-MDT0003-mdtlov_UUID@0@lo:127/0 lens 224/0 e 0 to 0 dl 1661774757 ref 2 fl New:/0/ffffffff rc 0/-1 job:'osp-pre-1-3.0' Lustre: lustre-MDT0000-osp-MDT0001: Connection restored to 10.240.26.53@tcp (at 10.240.26.53@tcp) Lustre: lustre-MDT0003: Client lustre-MDT0003-lwp-OST0002_UUID (at 10.240.26.52@tcp) reconnecting Lustre: Skipped 1 previous similar message Lustre: MGC10.240.26.53@tcp: Connection restored to 10.240.26.53@tcp (at 10.240.26.53@tcp) Lustre: lustre-MDT0001: Received new MDS connection from 10.240.26.53@tcp, keep former export from same NID Lustre: lustre-MDT0002-osp-MDT0003: Connection restored to 10.240.26.53@tcp (at 10.240.26.53@tcp) Lustre: Skipped 14 previous similar messages LustreError: 51051:0:(service.c:2289:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.240.26.53@tcp: deadline 20/1s ago req@000000008c60fa42 x1742478259713600/t0(0) o38->lustre-MDT0002-mdtlov_UUID@10.240.26.53@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774766 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 51051:0:(service.c:2327:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/2s); client may timeout req@000000008c60fa42 x1742478259713600/t0(0) o38->lustre-MDT0002-mdtlov_UUID@10.240.26.53@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774766 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-lwp-MDT0001: Connection to lustre-MDT0000 (at 10.240.26.53@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message LustreError: 51051:0:(service.c:2289:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.240.26.53@tcp: deadline 20/4s ago req@00000000886cba67 x1742478259721728/t0(0) o38->lustre-MDT0000-mdtlov_UUID@10.240.26.53@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774766 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 51051:0:(service.c:2327:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/4s); client may timeout req@00000000886cba67 x1742478259721728/t0(0) o38->lustre-MDT0000-mdtlov_UUID@10.240.26.53@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774766 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0003: Received new MDS connection from 0@lo, keep former export from same NID Lustre: 25225:0:(service.c:2327:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/4s); client may timeout req@0000000027a924f0 x1742478057310272/t0(0) o38->lustre-MDT0003-lwp-OST0002_UUID@10.240.26.52@tcp:0/0 lens 520/416 e 0 to 0 dl 1661774771 ref 1 fl Complete:H/0/0 rc 0/0 job:'kworker/u4:4.0' LustreError: 28602:0:(service.c:2289:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.240.26.52@tcp: deadline 20/4s ago req@0000000073732b89 x1742478057310336/t0(0) o38->lustre-MDT0001-lwp-OST0003_UUID@10.240.26.52@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774771 ref 1 fl Interpret:H/0/ffffffff rc 0/-1 job:'kworker/u4:4.0' Lustre: 25225:0:(service.c:2327:ptlrpc_server_handle_request()) Skipped 8 previous similar messages LustreError: 28602:0:(service.c:2289:ptlrpc_server_handle_request()) Skipped 7 previous similar messages Lustre: lustre-MDT0001: Client c712ec83-3787-4382-8825-3be32b653b63 (at 10.240.26.50@tcp) reconnecting Lustre: lustre-MDT0001: Received new MDS connection from 0@lo, keep former export from same NID Lustre: lustre-MDT0000-lwp-MDT0001: Connection restored to 10.240.26.53@tcp (at 10.240.26.53@tcp) Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774726/real 1661774726] req@000000001ef74094 x1742478279708608/t0(0) o400->lustre-MDT0000-lwp-MDT0001@10.240.26.53@tcp:12/10 lens 224/224 e 0 to 1 dl 1661774773 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: Skipped 4 previous similar messages Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Lustre: lustre-MDT0001: Export 00000000ed4f2d78 already connecting from 10.240.26.53@tcp Lustre: lustre-MDT0001: Export 000000001c79cc3f already connecting from 10.240.26.52@tcp Lustre: Skipped 4 previous similar messages Lustre: lustre-MDT0003: Export 000000000e911a8c already connecting from 10.240.26.51@tcp Lustre: Skipped 2 previous similar messages Lustre: 25226:0:(service.c:2327:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/8s); client may timeout req@000000003272a9da x1742478057309952/t0(0) o38->lustre-MDT0003-lwp-OST0000_UUID@10.240.26.52@tcp:0/0 lens 520/416 e 0 to 0 dl 1661774771 ref 1 fl Complete:H/0/0 rc 0/0 job:'kworker/u4:4.0' Lustre: 25226:0:(service.c:2327:ptlrpc_server_handle_request()) Skipped 7 previous similar messages Lustre: lustre-MDT0001: Received new MDS connection from 10.240.26.53@tcp, keep former export from same NID Lustre: Skipped 3 previous similar messages Lustre: lustre-MDT0003-osp-MDT0001: Connection restored to (at 0@lo) Lustre: Skipped 2 previous similar messages Lustre: lustre-MDT0001: Client lustre-MDT0001-lwp-OST0001_UUID (at 10.240.26.52@tcp) reconnecting Lustre: Skipped 22 previous similar messages | Link to test |
recovery-mds-scale test failover_mds: failover MDS | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [khugepaged:31] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover virtio_blk failover CPU: 0 PID: 31 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:copy_page+0x7/0x10 Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 RSP: 0018:ffffb5c780733d48 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 RAX: 00000000346ee867 RBX: ffffe3ff808cdd40 RCX: 0000000000000200 RDX: 7fffffffcb911798 RSI: ffff9cf5346ee000 RDI: ffff9cf523375000 RBP: 0000556d77f75000 R08: ffffe3ff80e1d418 R09: ffff9cf5bffd0000 R10: 00000000000305c0 R11: ffffffffffffffe8 R12: ffffe3ff80d1bb80 R13: ffff9cf53f5e9ba8 R14: ffff9cf5bfd54740 R15: ffff9cf5933b9ae0 FS: 0000000000000000(0000) GS:ffff9cf5bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000587000 CR3: 000000005400a006 CR4: 00000000000606f0 Call Trace: collapse_huge_page+0x6b6/0xf10 khugepaged+0xb5b/0x1150 ? finish_wait+0x80/0x80 ? collapse_huge_page+0xf10/0xf10 kthread+0x112/0x130 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x35/0x40 Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 31 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <IRQ> dump_stack+0x5c/0x80 panic+0xe7/0x2a9 ? __switch_to_asm+0x51/0x70 watchdog_timer_fn.cold.8+0x85/0x9e ? watchdog+0x30/0x30 __hrtimer_run_queues+0x100/0x280 hrtimer_interrupt+0x100/0x220 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:copy_page+0x7/0x10 | Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bi Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on trevis-78vm3 Lustre: DEBUG MARKER: Started client load: dd on trevis-78vm3 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on trevis-78vm4 Lustre: DEBUG MARKER: Started client load: tar on trevis-78vm4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dbench on trevis-78vm5 Lustre: DEBUG MARKER: Started client load: dbench on trevis-78vm5 Lustre: DEBUG MARKER: cat /tmp/client-load.pid Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641907486/real 1641907489] req@00000000c6d2f1fa x1721664603565184/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641907493 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xd9cf9bdad7eabcc7 to 0x2c0667baa92fef98 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000009fd8dce5 x1721664595309952/t4294967308(4294967308) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/600 e 0 to 0 dl 1641907567 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 1 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting... LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000008a5c099e x1721664595310144/t4294967302(4294967302) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641907639 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641907466/real 1641907466] req@000000004175aae9 x1721664599615616/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641907473 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641907461/real 1641907461] req@00000000207787dc x1721664599397696/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641907468 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=177 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=177 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641908681/real 1641908699] req@00000000d19744ea x1721664843220928/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908725 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641908687/real 1641908703] req@00000000f17e3dff x1721664843639808/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908731 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641908692/real 1641908707] req@0000000047f02f32 x1721664843647168/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908736 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641908697/real 1641908714] req@000000002f64aea5 x1721664843979008/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908741 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000eb58fa83 x1721664813324544/t8590036363(8590036363) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641908834 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641908676/real 1641908676] req@0000000072114042 x1721664843109632/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908720 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 1 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 1 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=1384 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=1384 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641909877/real 1641909895] req@000000004ce8ca2e x1721665108794048/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641909921 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641909856/real 1641909856] req@0000000094fa6add x1721665108785472/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641909900 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: 26040:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641909860/real 1641909860] req@00000000f7b9087f x1721665108793024/t0(0) o41->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/368 e 0 to 1 dl 1641909904 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'lfs.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641909867/real 1641909867] req@00000000409db90a x1721665108793536/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641909911 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 2 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 2 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=2547 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=2547 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641911083/real 1641911099] req@000000003002e1f6 x1721665315873216/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641911127 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641911098/real 1641911101] req@000000007e3b24ba x1721665318205952/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641911142 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641911072/real 1641911072] req@0000000037bacdd0 x1721665314697024/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641911116 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x2c0667baa92fef98 to 0xd9ce118587b743e9 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 2 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 2 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=3771 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=3771 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641912293/real 1641912310] req@0000000085aa58ab x1721665542839360/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641912337 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641912308/real 1641912311] req@000000007652bc10 x1721665544825472/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641912352 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641912287/real 1641912287] req@00000000d95fc45d x1721665541615872/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641912331 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xd9ce118587b743e9 to 0xdd61fd8df9f947bf Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 3 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 3 times, and counting... LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000ec1adff4 x1721665510059136/t17180064190(17180064190) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641912502 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=5033 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=5033 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641913478/real 1641913494] req@0000000031b9d48f x1721665750135168/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913522 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641913483/real 1641913502] req@000000003688dcb8 x1721665750181056/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913527 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641913489/real 1641913506] req@0000000034c73fab x1721665750842560/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913533 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641913494/real 1641913510] req@000000000b555843 x1721665753307136/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913538 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000076b44c7 x1721665732147776/t21474933494(21474933494) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641913666 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641913458/real 1641913458] req@0000000019fe68fe x1721665748463488/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913502 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 4 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 4 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=6215 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=6215 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641914683/real 1641914702] req@00000000d837a634 x1721666010161344/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914727 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641914688/real 1641914706] req@000000004d2420f6 x1721666011103168/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914732 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641914693/real 1641914711] req@000000006b628bfe x1721666011110656/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914737 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641914698/real 1641914715] req@000000000cbb3c5c x1721666011128448/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914742 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641914678/real 1641914678] req@00000000110bf663 x1721666010153344/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914722 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 5 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 5 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=7407 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=7407 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641915888/real 1641915907] req@00000000f3f2ddae x1721666178773568/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641915932 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641915893/real 1641915911] req@00000000b4dafc3f x1721666179247936/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641915937 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641915899/real 1641915915] req@000000004c3d765e x1721666179761664/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641915943 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641915873/real 1641915873] req@00000000bc8623bc x1721666178289024/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641915917 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 6 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 6 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=8622 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=8622 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641917078/real 1641917094] req@00000000eb61a38f x1721666447615744/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641917122 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641917083/real 1641917103] req@0000000092968078 x1721666447623168/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641917127 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641917088/real 1641917107] req@0000000088928f67 x1721666448321280/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641917132 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641917068/real 1641917068] req@00000000ebbd8e4e x1721666447298496/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641917112 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000033f8e376 x1721666414096960/t34359811852(34359811852) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641917266 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 7 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 7 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=9814 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=9814 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641918282/real 1641918299] req@0000000098236704 x1721666734036288/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641918326 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641918288/real 1641918307] req@000000007efd3cef x1721666734301824/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641918332 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641918298/real 1641918314] req@0000000087b6b5f1 x1721666736951424/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641918342 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000080c5f413 x1721666707513600/t38654767736(38654767736) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641918457 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641918272/real 1641918272] req@00000000a9b732ee x1721666731912192/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641918316 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 8 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 8 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=11009 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=11009 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641919494/real 1641919511] req@00000000d364b87a x1721667001712640/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641919502 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641919499/real 1641919515] req@000000000c58febb x1721667002386688/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641919507 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641919504/real 1641919523] req@00000000de51ad0f x1721667003042112/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641919512 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0xdd61fd8df9f947bf to 0xd5cb733245e0bbfb Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641919489/real 1641919489] req@0000000087b6b5f1 x1721667001704384/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641919497 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 3 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 3 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=12212 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=12212 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641920704/real 1641920722] req@00000000ab7a5f22 x1721667276393216/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641920711 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641920709/real 1641920727] req@00000000d3db9bd3 x1721667277290048/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641920755 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641920714/real 1641920731] req@0000000082c3aa73 x1721667280550208/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641920760 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641920720/real 1641920739] req@000000002ef8c9f3 x1721667282582720/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641920766 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641920699/real 1641920699] req@000000000d4d3db0 x1721667273177280/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641920745 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xd5cb733245e0bbfb to 0x61cff7c6f92a222f Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 4 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 4 times, and counting... Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=13470 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=13470 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641921868/real 1641921868] req@00000000c09f8d42 x1721667529674944/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641921875 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 5 previous similar messages LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message LustreError: 52548:0:(lmv_obd.c:1262:lmv_statfs()) lustre-MDT0000-mdc-ffff9cf5557a4800: can't stat MDS #0: rc = -11 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641921873/real 1641921873] req@000000008089a10b x1721667529686720/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641921881 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x61cff7c6f92a222f to 0x6a8dbfe4dffe5d86 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 5 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 5 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=14578 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=14578 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641923110/real 1641923127] req@000000008089a10b x1721667826652736/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641923117 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641923115/real 1641923134] req@00000000ce3e7492 x1721667828640192/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923161 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641923120/real 1641923138] req@00000000674a618f x1721667830246208/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923166 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641923125/real 1641923142] req@00000000faa952d1 x1721667831969024/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923171 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641923100/real 1641923100] req@000000007c32f4e6 x1721667824529664/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923146 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641923105/real 1641923105] req@000000005ce0674f x1721667825720000/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923151 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0x6a8dbfe4dffe5d86 to 0x28cd6ee3b8635303 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 6 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 6 times, and counting... LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000029ee2b09 x1721667783457216/t47244859380(47244859380) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641923337 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=15869 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=15869 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641924290/real 1641924306] req@00000000c0004f41 x1721668084521408/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641924334 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641924295/real 1641924315] req@0000000096416ba7 x1721668086979712/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641924339 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641924305/real 1641924323] req@0000000039ca71e7 x1721668089829184/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641924349 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000001931ba9f x1721668060114368/t51539679322(51539679322) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641924478 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641924285/real 1641924285] req@000000009e7e0a87 x1721668081339968/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641924329 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 9 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 9 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=17027 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=17027 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 60759:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641925464/real 1641925464] req@0000000096416ba7 x1721668427720064/t0(0) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/1152 e 0 to 1 dl 1641925472 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'rm.0' Lustre: 60759:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641925471/real 1641925471] req@00000000688a388e x1721668427721024/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641925479 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x28cd6ee3b8635303 to 0x3cce1aafc4a35f99 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 7 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 7 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=18171 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=18171 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641926709/real 1641926726] req@00000000a9f791cc x1721668705866688/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641926716 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641926715/real 1641926730] req@000000002f460642 x1721668707760960/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641926723 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641926720/real 1641926738] req@0000000081437b8a x1721668709333184/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641926728 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641926725/real 1641926743] req@00000000e2da85cf x1721668710645568/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641926733 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641926689/real 1641926689] req@00000000f41ba24b x1721668700215744/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641926697 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0x3cce1aafc4a35f99 to 0x15ba538cc0056173 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000002e608d0f x1721668652652736/t38654782649(38654782649) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/600 e 0 to 0 dl 1641926841 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 10 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 10 times, and counting... Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=19450 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=19450 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641927888/real 1641927892] req@000000006dfad8a4 x1721668949227712/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641927934 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641927877/real 1641927877] req@00000000fd61f3cb x1721668948092928/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641927923 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000934e0a33 x1721668925025728/t60129654331(60129654331) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641928099 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 11 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 11 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=20637 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=20637 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641929092/real 1641929110] req@000000009766d5b0 x1721669309189824/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641929138 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641929108/real 1641929111] req@000000002bb95a55 x1721669311119040/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641929154 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000934e0a33 x1721669283540416/t64424554980(64424554980) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641929204 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641929087/real 1641929087] req@0000000092cc6708 x1721669307531904/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641929133 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 12 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 12 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=21754 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=21754 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641930464/real 1641930464] req@0000000016c6bc01 x1721669670936640/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641930472 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1641930474/real 0] req@0000000028c3a584 x1721669671710464/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641930482 ref 2 fl Rpc:XNr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 13 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 13 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=23102 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=23102 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641931478/real 1641931478] req@00000000fc3e9eeb x1721669906480448/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641931485 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: 72069:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641931478/real 1641931478] req@00000000f0aede40 x1721669906481280/t0(0) o35->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:23/10 lens 392/624 e 0 to 1 dl 1641931485 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'dd.0' Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641931483/real 1641931483] req@00000000b303f4ba x1721669906482048/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641931491 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x15ba538cc0056173 to 0x30c19e657b56c98a Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 8 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 8 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=24189 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=24189 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641932722/real 1641932738] req@0000000074bf8636 x1721670157398656/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641932766 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641932738/real 1641932741] req@000000008324bf3d x1721670159077248/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641932782 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641932702/real 1641932702] req@000000002351ccaf x1721670153540800/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641932746 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0x30c19e657b56c98a to 0xe8ea3230671f3e6e Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000b20c946a x1721670113932928/t47244700810(47244700810) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/600 e 0 to 0 dl 1641932870 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 9 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 9 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=25433 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=25433 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641933902/real 1641933919] req@000000002351ccaf x1721670408059584/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641933910 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641933917/real 1641933920] req@00000000332edb71 x1721670409194112/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641933925 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000fc3e9eeb x1721670384152000/t77309482894(77309482894) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641933984 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641933897/real 1641933897] req@00000000f2173751 x1721670406085376/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641933905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 14 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 14 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=26571 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=26571 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641935098/real 1641935115] req@00000000ee4b96e7 x1721670691929984/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641935144 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641935113/real 1641935116] req@0000000048971edb x1721670692936384/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641935159 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000007aa69edf x1721670680872640/t81604452937(81604452937) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641935209 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 15 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 15 times, and counting... Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641935083/real 1641935083] req@00000000819367e5 x1721670691523520/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641935129 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641935093/real 1641935093] req@00000000d09005e0 x1721670691874560/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641935139 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=27761 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=27761 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641936305/real 1641936323] req@00000000bc77343d x1721671041002304/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936313 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641936310/real 1641936327] req@00000000a101f6de x1721671041968000/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936318 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641936316/real 1641936331] req@00000000ba53cb5c x1721671042863424/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936324 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641936321/real 1641936337] req@0000000037bb6f6b x1721671044140032/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936329 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0xe8ea3230671f3e6e to 0x89e752e48866fbd8 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000902a45de x1721671013056640/t51539805408(51539805408) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641936438 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641936290/real 1641936290] req@00000000f6b90054 x1721671039508416/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936298 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 10 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 10 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=29029 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=29029 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641937520/real 1641937539] req@00000000a7007eec x1721671346967808/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641937527 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641937525/real 1641937542] req@00000000e989cf6d x1721671347854336/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641937533 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641937530/real 1641937546] req@0000000008c3f8cd x1721671348861376/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641937538 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641937535/real 1641937552] req@00000000768df730 x1721671349650944/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641937543 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0x89e752e48866fbd8 to 0x7c968768543a351c Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000099c2ad2a x1721671311053120/t55834665810(55834665810) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 648/600 e 0 to 0 dl 1641937645 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641937515/real 1641937515] req@00000000c29f8e3f x1721671345343104/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641937523 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 16 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 16 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=30262 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=30262 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641938704/real 1641938719] req@0000000053740272 x1721671704752832/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641938712 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641938719/real 1641938722] req@00000000902a45de x1721671709414656/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641938727 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x7c968768543a351c to 0xb9fcf8380bb0a672 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000099b6b4ad x1721671696766592/t60129597602(60129597602) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641938868 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641938683/real 1641938683] req@000000008202ccd3 x1721671699893888/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641938691 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 11 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 11 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=31427 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=31427 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641939893/real 1641939893] req@000000000848fa73 x1721672072210176/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641939900 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641939893/real 1641939893] req@00000000b1bf3629 x1721672072210240/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641939901 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641939899/real 1641939899] req@000000009eaf20ad x1721672072210432/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641939907 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xb9fcf8380bb0a672 to 0x2869ec10b340e9b3 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000678ab890 x1721672027449408/t64424575609(64424575609) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/600 e 0 to 0 dl 1641940026 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 17 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 17 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=32615 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=32615 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641941105/real 1641941123] req@00000000c44efa57 x1721672394969536/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641941151 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000009f4098ad x1721672393424256/t94489318772(94489318772) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641941225 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641941095/real 1641941095] req@00000000d02e1b64 x1721672394954304/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641941141 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 18 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 18 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=33777 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=33777 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 96733:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641942288/real 1641942288] req@00000000fee45b25 x1721672803164480/t0(0) o35->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:23/10 lens 392/624 e 0 to 1 dl 1641942295 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'dd.0' Lustre: 96733:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641942290/real 1641942290] req@00000000a10aeceb x1721672803165312/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641942336 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641942295/real 1641942295] req@0000000082f46ed6 x1721672803165568/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641942341 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 19 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 19 times, and counting... LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000781ee55e x1721672703195712/t98784297415(98784297415) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641942413 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=34943 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=34943 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641943529/real 1641943547] req@00000000e28bc600 x1721673181356992/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943573 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641943534/real 1641943551] req@0000000025825f90 x1721673181608704/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943578 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641943539/real 1641943555] req@00000000d3c205e0 x1721673182036416/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943583 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641943545/real 1641943561] req@0000000044cf5f33 x1721673182173504/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943589 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641943514/real 1641943514] req@0000000072791521 x1721673179968320/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943558 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000cf93a0d1 x1721673106339392/t103079271407(103079271407) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641943635 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 20 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 20 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=36184 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=36184 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1641944738/real 0] req@00000000aaf6af78 x1721673430074304/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641944745 ref 2 fl Rpc:XNr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641944738/real 1641944755] req@000000004a07fb0d x1721673430074368/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641944784 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641944744/real 1641944759] req@00000000412b9686 x1721673430623616/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641944790 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641944718/real 1641944718] req@00000000dc575401 x1721673424520640/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641944764 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641944728/real 1641944728] req@00000000d011aadf x1721673428338816/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641944774 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x2869ec10b340e9b3 to 0xb30e277e7821f89b Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 12 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 12 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=37422 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=37422 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641945955/real 1641945958] req@000000004be3d4e0 x1721673659613184/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641946001 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641945935/real 1641945935] req@000000002071aac8 x1721673654736384/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641945981 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641945929/real 1641945929] req@00000000aa49c1ef x1721673653445696/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641945975 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xb30e277e7821f89b to 0xbf5b38fddc4dd777 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 13 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 13 times, and counting... Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=38652 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=38652 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641947117/real 1641947135] req@00000000a97a2781 x1721673829299584/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641947161 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641947122/real 1641947139] req@00000000fa7aecb0 x1721673829722432/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641947166 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641947127/real 1641947146] req@00000000a37f3a15 x1721673829768192/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641947171 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000243e21ee x1721673813906688/t111669187631(111669187631) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641947243 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 21 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 21 times, and counting... Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641947112/real 1641947112] req@0000000061c4c4c9 x1721673829291584/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641947156 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=39787 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=39787 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641948327/real 1641948342] req@00000000914b72f4 x1721674006028800/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948371 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641948332/real 1641948350] req@00000000f492794c x1721674006084224/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948376 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641948337/real 1641948354] req@000000002d750631 x1721674006340736/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948381 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641948342/real 1641948358] req@00000000d62fe959 x1721674006349248/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948386 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000375937ab x1721673981731200/t115964239485(115964239485) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641948442 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641948321/real 1641948321] req@000000002864288b x1721674006020480/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948365 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 22 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 22 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=40988 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=40988 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641949546/real 1641949550] req@00000000f492794c x1721674247782976/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641949590 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000a37f3a15 x1721674221628864/t120259150764(120259150764) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641949675 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 23 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 23 times, and counting... Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641949536/real 1641949536] req@00000000fad0c7f3 x1721674247165440/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641949580 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=42221 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=42221 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641950728/real 1641950748] req@00000000b51ebf25 x1721674522467776/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641950735 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641950733/real 1641950752] req@00000000aacc57fa x1721674523822144/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950741 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641950738/real 1641950755] req@000000006a4f78e8 x1721674524402560/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950746 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641950743/real 1641950759] req@00000000957542c6 x1721674525328128/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950751 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000603e1b8b x1721674500216512/t77309801101(77309801101) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641950832 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641950713/real 1641950713] req@00000000ccfd96b0 x1721674517642432/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950721 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641950723/real 1641950723] req@000000000b663dd6 x1721674520721920/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950731 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 14 times, and counting... Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0xbf5b38fddc4dd777 to 0xe35e4df03c609009 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: mds1 has failed over 14 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=43420 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=43420 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641951930/real 1641951947] req@00000000595fa783 x1721674779246144/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641951937 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641951945/real 1641951948] req@0000000036f58690 x1721674782792896/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641951991 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xe35e4df03c609009 to 0x813117028d1b09ea Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000ec947795 x1721674758516096/t81604486357(81604486357) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 576/600 e 0 to 0 dl 1641952080 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641951919/real 1641951919] req@00000000a9e49b40 x1721674776412800/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641951965 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 15 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 15 times, and counting... LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000087835faf x1721674758515584/t124554160968(124554160968) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641952131 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0' Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=44654 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=44654 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641953140/real 1641953158] req@0000000047c592da x1721675041788352/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953148 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641953145/real 1641953162] req@00000000aee522e5 x1721675042637376/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953153 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641953150/real 1641953167] req@00000000fca5e081 x1721675044588480/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953158 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641953155/real 1641953172] req@000000000ea49979 x1721675045509056/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953163 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 24 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 24 times, and counting... Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641953130/real 1641953130] req@00000000380bb76f x1721675039164416/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953138 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=45800 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=45800 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641954323/real 1641954339] req@0000000008fb39ac x1721675289991232/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641954330 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641954328/real 1641954343] req@00000000a37f3a15 x1721675292029632/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641954336 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641954333/real 1641954351] req@000000000ddc4cbd x1721675292529344/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641954341 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x813117028d1b09ea to 0xed24f2b7899fd13c Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000071d7a2d2 x1721675285146624/t85899532328(85899532328) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641954452 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641954318/real 1641954318] req@00000000a734bdbe x1721675287693120/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641954326 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 16 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 16 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=47044 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=47044 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641955520/real 1641955520] req@00000000af84ce89 x1721675539534912/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641955527 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641955544/real 1641955562] req@0000000024960aeb x1721675539537472/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641955590 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641955520/real 1641955520] req@000000001e8a4178 x1721675539535040/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641955566 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641955554/real 1641955570] req@000000003af6518b x1721675539537856/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641955600 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641955533/real 1641955533] req@00000000e962f5bc x1721675539537088/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641955579 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xed24f2b7899fd13c to 0x1753d2d5422b7f62 Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 17 times, and counting... Lustre: DEBUG MARKER: mds1 has failed over 17 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=48242 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=48242 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2 Lustre: DEBUG MARKER: Starting failover on mds2 Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641956742/real 1641956758] req@00000000784f52c3 x1721675862288704/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641956788 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641956747/real 1641956767] req@00000000703807ed x1721675863299968/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641956793 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641956752/real 1641956771] req@0000000013bc9dcb x1721675864695232/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641956798 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000043e35ee8 x1721675808011008/t137439011322(137439011322) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641956862 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641956732/real 1641956732] req@00000000d53778aa x1721675859098432/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641956778 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp) Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 25 times, and counting... Lustre: DEBUG MARKER: mds2 has failed over 25 times, and counting... Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=49410 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=49410 DURATION=86400 PERIOD=1200 Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid Lustre: DEBUG MARKER: lctl get_param -n at_max Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 Lustre: DEBUG MARKER: Starting failover on mds1 Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641957943/real 1641957959] req@000000002e072981 x1721676134565824/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641957989 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641957948/real 1641957968] req@0000000018025cea x1721676135709632/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641957994 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641957922/real 1641957922] req@000000000e75c9bc x1721676131893184/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641957968 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641957938/real 1641957938] req@0000000013bc9dcb x1721676134153664/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641957984 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages | Link to test |