Editing crashreport #8152

ReasonCrashing FunctionWhere to cut BacktraceReports Count
watchdog: BUG: soft lockup - copy_pagecollapse_huge_page
khugepaged
kthread
ret_from_fork
panic
watchdog_timer_fn
__hrtimer_run_queues
hrtimer_interrupt
smp_apic_timer_interrupt
apic_timer_interrupt
55

Added fields:

Match messages in logs
(every line would be required to be present in log output
Copy from "Messages before crash" column below):
Match messages in full crash
(every line would be required to be present in crash log output
Copy from "Full Crash" column below):
Limit to a test:
(Copy from below "Failing text"):
Delete these reports as invalid (real bug in review or some such)
Bug or comment:
Extra info:

Failures list (last 100):

Failing TestFull CrashMessages before crashComment
sanity-sec test 19: test nodemap trusted_admin fileops
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) dm_mod zfs(POE) spl(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev i2c_piix4 virtio_balloon pcspkr ext4 mbcache jbd2 ata_generic crc32c_intel ata_piix virtio_net libata serio_raw virtio_blk net_failover failover [last unloaded: obdecho]
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.58.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb3360074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 00000000905cf867 RBX: ffffeac9024173c0 RCX: 0000000000000200
RDX: 7fffffff6fa30798 RSI: ffff90a6505cf000 RDI: ffff90a6643cf000
RBP: 000055f43a9cf000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 0000000000000000 R12: ffff90a6012d1e78
R13: ffff90a613268000 R14: ffffeac90290f3c0 R15: ffff90a5c5dca740
FS: 0000000000000000(0000) GS:ffff90a67fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff9aba24000 CR3: 0000000051810006 CR4: 00000000000606f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL -------- - - 4.18.0-553.58.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.*.identity_upcall=NONE
Lustre: 258578:0:(mdt_lproc.c:311:identity_upcall_store()) lustre-MDT0001: disable "identity_upcall" with ACL enabled maybe cause unexpected "EACCESS"
Lustre: 258578:0:(mdt_lproc.c:311:identity_upcall_store()) Skipped 1 previous similar message
Lustre: 210773:0:(nodemap_handler.c:3055:nodemap_create()) adding nodemap 'c0' to config without default nodemap
Lustre: 210773:0:(nodemap_handler.c:3055:nodemap_create()) Skipped 26 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.active
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c1.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c1.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c1.admin_nodemap
Link to test
sanity-quota test 16b: lfs quota should skip the nonexistent MDT/OST
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon i2c_piix4 joydev pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel libata serio_raw virtio_net net_failover failover virtio_blk [last unloaded: dm_flakey]
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffc07300753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000003ca86867 RBX: ffffec46c0f2a180 RCX: 0000000000000200
RDX: 7fffffffc3579798 RSI: ffff9e41fca86000 RDI: ffff9e420c5f7000
RBP: 000055cbc1bf7000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffee R12: ffff9e41f557afb8
R13: ffff9e41f6655000 R14: ffffec46c1317dc0 R15: ffff9e41c5d7ae80
FS: 0000000000000000(0000) GS:ffff9e427fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f843eecc48 CR3: 000000000e810004 CR4: 00000000001706e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
Lustre: lustre-MDT0000: Not available for connect from 10.240.30.80@tcp (stopping)
Lustre: Skipped 11 previous similar messages
Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
Lustre: server umount lustre-MDT0000 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
LustreError: 137-5: lustre-MDT0000: not available for connect from 10.240.30.80@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 18 previous similar messages
LustreError: 107302:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) ldlm_cancel from 10.240.28.51@tcp arrived at 1752068514 with bad export cookie 8652668417692940817
LustreError: 107302:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) Skipped 4 previous similar messages
LustreError: 137-5: lustre-MDT0000: not available for connect from 10.240.30.80@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 11 previous similar messages
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3
LustreError: 107302:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1752068521 with bad export cookie 8652668417692940075
LustreError: 166-1: MGC10.240.28.44@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0002: Not available for connect from 10.240.28.51@tcp (stopping)
Lustre: Skipped 5 previous similar messages
Lustre: server umount lustre-MDT0002 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=1981808 --mkfsoptions="-O ea_inode,large_dir" --index=0 --reformat /dev/mapper/mds1_flakey
LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Autotest: Test running for 210 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28)
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=onyx-99vm1@tcp --fsname=lustre --mdt --index=2 --param=sys.timeout=20 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=1981808 --mkfsoptions="-O ea_inode,large_dir" --index=100 --reformat /dev/ma
LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey 2>&1
Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1
LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: errors=remount-ro
LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Lustre: Setting parameter lustre-MDT0000.mdt.identity_upcall in log lustre-MDT0000
Lustre: Skipped 3 previous similar messages
Lustre: ctl-lustre-MDT0000: No data found on store. Initialize space: rc = -61
Lustre: lustre-MDT0000: new disk, initializing
Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Lustre: DEBUG MARKER: sync; sleep 1; sync
Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null
Lustre: Modifying parameter lustre-MDT0001.mdt.identity_upcall in log lustre-MDT0001
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm8.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-99vm8.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds3
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds3_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds3_flakey 2>&1
Lustre: DEBUG MARKER: test -b /dev/mapper/mds3_flakey
Lustre: DEBUG MARKER: e2label /dev/mapper/mds3_flakey
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds3; mount -t lustre -o localrecov /dev/mapper/mds3_flakey /mnt/lustre-mds3
LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: errors=remount-ro
LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Lustre: Modifying parameter lustre-MDT0064.mdt.identity_upcall in log lustre-MDT0064
Lustre: 706434:0:(mgc_request.c:1926:mgc_llog_local_copy()) MGC10.240.28.44@tcp: no remote llog for lustre-sptlrpc, check MGS config
Lustre: srv-lustre-MDT0064: No data found on store. Initialize space: rc = -61
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0064: new disk, initializing
Lustre: lustre-MDT0064: Imperative Recovery not enabled, recovery window 60-180
Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000280000400-0x00000002c0000400]:64:mdt
Lustre: Skipped 1 previous similar message
Lustre: cli-ctl-lustre-MDT0064: Allocated super-sequence [0x0000000280000400-0x00000002c0000400]:64:mdt]
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Lustre: DEBUG MARKER: sync; sleep 1; sync
Lustre: DEBUG MARKER: e2label /dev/mapper/mds3_flakey 2>/dev/null
Lustre: lustre-OST0000-osc-MDT0000: update sequence from 0x100000000 to 0x2c0000401
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-145vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-145vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0064-osc-MDT0000: update sequence from 0x100640000 to 0x300000401
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-145vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-145vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osp.*.destroys_in_flight
Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osp.*.destroys_in_flight
Lustre: DEBUG MARKER: lctl set_param -n os[cd]*.*MDT*.force_sync=1
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: lfs --list-commands
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: lfs --list-commands
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
Lustre: lustre-MDT0000: Not available for connect from 10.240.28.51@tcp (stopping)
Lustre: Skipped 16 previous similar messages
Lustre: server umount lustre-MDT0000 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
LustreError: 137-5: lustre-MDT0000: not available for connect from 10.240.30.80@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 33 previous similar messages
Lustre: DEBUG MARKER: modprobe dm-flakey;
LustreError: 704436:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) ldlm_cancel from 10.240.28.51@tcp arrived at 1752069008 with bad export cookie 8652668417695125167
LustreError: 704436:0:(ldlm_lockd.c:2572:ldlm_cancel_handler()) Skipped 8 previous similar messages
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3
Lustre: 13559:0:(client.c:2355:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1752069008/real 1752069008] req@000000000bc8652e x1837164635120448/t0(0) o400->MGC10.240.28.44@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1752069015 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker.0'
LustreError: 166-1: MGC10.240.28.44@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Autotest: Test running for 215 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28)
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-145vm9.onyx.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: onyx-145vm9.onyx.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm8.onyx.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: onyx-99vm8.onyx.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: [ -e /dev/mapper/mds1_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=1981808 --mkfsoptions="-O ea_inode,large_dir" --reformat /dev/mapper/mds1_flakey
LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Autotest: Test running for 220 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28)
Lustre: DEBUG MARKER: [ -e /dev/mapper/mds3_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=onyx-99vm1@tcp --fsname=lustre --mdt --index=2 --param=sys.timeout=20 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=1981808 --mkfsoptions="-O ea_inode,large_dir" --reformat /dev/mapper/mds3_fl
Autotest: Test running for 225 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28)
INFO: task mke2fs:713312 blocked for more than 120 seconds.
Tainted: G OE -------- - - 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:mke2fs state:D stack:0 pid:713312 ppid:713311 flags:0x00004080
Call Trace:
__schedule+0x2d1/0x870
? blk_flush_plug_list+0xd7/0x100
? wbt_exit+0x30/0x30
? __wbt_done+0x40/0x40
schedule+0x55/0xf0
io_schedule+0x12/0x40
rq_qos_wait+0xb3/0x130
? karma_partition+0x1f0/0x1f0
? wbt_exit+0x30/0x30
wbt_wait+0x96/0xc0
__rq_qos_throttle+0x23/0x40
blk_mq_make_request+0x131/0x5c0
generic_make_request_no_check+0xe1/0x330
submit_bio+0x3c/0x160
blk_next_bio+0x33/0x40
__blkdev_issue_zero_pages+0x90/0x190
blkdev_issue_zeroout+0xef/0x222
blkdev_fallocate+0x13f/0x1a0
vfs_fallocate+0x140/0x280
ksys_fallocate+0x3c/0x80
__x64_sys_fallocate+0x1a/0x30
do_syscall_64+0x5b/0x1a0
entry_SYSCALL_64_after_hwframe+0x66/0xcb
RIP: 0033:0x7f18c8bfa62b
Code: Unable to access opcode bytes at RIP 0x7f18c8bfa601.
RSP: 002b:00007ffda4b24558 EFLAGS: 00000246 ORIG_RAX: 000000000000011d
RAX: ffffffffffffffda RBX: 00005614afb868e0 RCX: 00007f18c8bfa62b
RDX: 000000004f512000 RSI: 0000000000000010 RDI: 0000000000000003
RBP: 000000004f512000 R08: 00007ffda4b246bc R09: 0000000000000000
R10: 00000000116b8000 R11: 0000000000000246 R12: 00000000116b8000
R13: 0000000000000003 R14: 00000000000116b8 R15: 00005614afb86760
Autotest: Test running for 230 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28)
LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Autotest: Test running for 240 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28)
Autotest: Test running for 245 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28)
Autotest: Test running for 250 minutes (lustre-b_es-reviews_review-dne-part-4_24391.28)
Link to test
sanityn test complete, duration 6169 sec
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34]
Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache i2c_piix4 virtio_balloon intel_rapl_msr intel_rapl_common joydev pcspkr crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_blk serio_raw virtio_net net_failover failover [last unloaded: libcfs]
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.53.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffae2280753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 00000000132b0845 RBX: ffffdc49804cac00 RCX: 0000000000000200
RDX: 7fffffffecd4f7ba RSI: ffff8fd3932b0000 RDI: ffff8fd3f6cdb000
RBP: 0000565121adb000 R08: 00000000000396d8 R09: 00000000000396d0
R10: 0000000000000007 R11: 00000000fffffff9 R12: ffff8fd383cd56d8
R13: ffff8fd39c04a800 R14: ffffdc4981db36c0 R15: ffff8fd3b533a828
FS: 0000000000000000(0000) GS:ffff8fd43fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f260b90a3c8 CR3: 000000001a610006 CR4: 00000000000606f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
? mutex_lock+0xe/0x30
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.53.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark === sanityn: start cleanup 19:21:11 \(1750620071\) ===
Lustre: DEBUG MARKER: === sanityn: start cleanup 19:21:11 (1750620071) ===
Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts);
LustreError: 757794:0:(lov_obd.c:784:lov_cleanup()) lustre-clilov-ffff8fd382c68800: lov tgt 0 not cleaned! deathrow=0, lovrc=1
LustreError: 757794:0:(obd_class.h:481:obd_check_dev()) Device 28 not setup
Lustre: Unmounted lustre-client
Lustre: DEBUG MARKER: /usr/sbin/lctl mark === sanityn: finish cleanup 19:21:58 \(1750620118\) ===
Lustre: DEBUG MARKER: === sanityn: finish cleanup 19:21:58 (1750620118) ===
Lustre: 749114:0:(client.c:2453:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1750620139/real 0] req@ffff8fd383ec9380 x1835657860501376/t0(0) o400->lustre-OST0000-osc-ffff8fd382c6b800@10.240.43.221@tcp:28/4 lens 224/224 e 0 to 1 dl 1750620155 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 projid:4294967295
Lustre: 749114:0:(client.c:2453:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: lustre-OST0000-osc-ffff8fd382c6b800: Connection to lustre-OST0000 (at 10.240.43.221@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: MGC10.240.40.114@tcp: Connection to MGS (at 10.240.40.114@tcp) was lost; in progress operations using this service will fail
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [khugepaged:34]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc dm_mod ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover virtio_blk failover
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffa06dc0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000012dba1865 RBX: ffffdf6d04b6e840 RCX: 0000000000000200
RDX: 7ffffffed245e79a RSI: ffff918cadba1000 RDI: ffff918c1b879000
RBP: 000055e720679000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff7 R12: ffff918c8479c3c8
R13: ffff918cbbd8a800 R14: ffffdf6d026e1e40 R15: ffff918c8299e910
FS: 0000000000000000(0000) GS:ffff918cbbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f68d156c024 CR3: 000000009ae10006 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 20 minutes (lustre-master_full-part-1_4624.118)
Autotest: Test running for 25 minutes (lustre-master_full-part-1_4624.118)
Autotest: Test running for 30 minutes (lustre-master_full-part-1_4624.118)
Autotest: Test running for 35 minutes (lustre-master_full-part-1_4624.118)
Autotest: Test running for 40 minutes (lustre-master_full-part-1_4624.118)
Autotest: Test running for 45 minutes (lustre-master_full-part-1_4624.118)
Autotest: Test running for 55 minutes (lustre-master_full-part-1_4624.118)
Lustre: 7633:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1745406313/real 1745406313] req@00000000ec651fcf x1830187634946496/t0(0) o400->MGC10.240.27.90@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1745406320 ref 1 fl Rpc:RXNQ/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
LustreError: 166-1: MGC10.240.27.90@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.240.27.84@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: MGS: Client e6a52ee6-f037-47b7-a199-72f5f3e858db (at 0@lo) reconnecting
Lustre: lustre-MDT0000: Received new LWP connection from 0@lo, keep former export from same NID
Lustre: lustre-MDT0000-lwp-MDT0000: Connection restored to 10.240.27.90@tcp (at 0@lo)
Lustre: lustre-OST0000-osc-MDT0000: Connection restored to (at 10.240.27.84@tcp)
Lustre: Skipped 1 previous similar message
Link to test
sanity test 900: umount should not race with any mgc requeue thread
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:35]
Modules linked in: dm_flakey osp(OE) ofd(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 intel_rapl_msr dns_resolver nfs lockd grace fscache intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon i2c_piix4 joydev sunrpc pcspkr dm_mod ext4 mbcache jbd2 ata_generic crc32c_intel serio_raw ata_piix virtio_net libata virtio_blk net_failover failover [last unloaded: obdecho]
CPU: 0 PID: 35 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.46.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffb6948075bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000059c00867 RBX: fffffa8481670000 RCX: 0000000000000200
RDX: 7fffffffa63ff798 RSI: ffff92e2d9c00000 RDI: ffff92e293d50000
RBP: 000055f7e3d50000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffffe R12: ffff92e2c8a1fa80
R13: ffff92e2cc662800 R14: fffffa84804f5400 R15: ffff92e2b52622b8
FS: 0000000000000000(0000) GS:ffff92e33fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f7e3e101f0 CR3: 000000004ac10005 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 35 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.46.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu: 0-...0: (20250 ticks this GP) idle=542/1/0x4000000000000002 softirq=7342043/7342043 fqs=1902
(detected by 1, t=60002 jiffies, g=10469413, q=1002)
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
Sending NMI from CPU 1 to CPUs 0:
apic_timer_interrupt+0xf/0x20
NMI backtrace for cpu 0
CPU: 0 PID: 35 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.46.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
</IRQ>
RIP: 0010:__orc_find+0x44/0x80
Autotest: Test running for 385 minutes (lustre-master_rolling-upgrade-mds_4622.145)
Lustre: lustre-MDT0000-lwp-OST0002: Connection to lustre-MDT0000 (at 10.240.28.195@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 6 previous similar messages
LustreError: MGC10.240.28.195@tcp: Connection to MGS (at 10.240.28.195@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.28.195@tcp (at 10.240.28.195@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: Evicted from MGS (at 10.240.28.195@tcp) after server handle changed from 0x6e639d631a8049c to 0x6e639d631a9092b
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-106vm12.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-106vm12.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0003: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x300000400:75022 to 0x300000400:75105)
Lustre: lustre-OST0000: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x240000400:240362 to 0x240000400:249377)
Lustre: lustre-OST0001: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x280000400:75064 to 0x280000400:75169)
Lustre: lustre-OST0002: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x2c0000400:74989 to 0x2c0000400:75073)
Lustre: lustre-OST0005: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x380000400:74979 to 0x380000400:75009)
Lustre: lustre-OST0004: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x340000400:74957 to 0x340000400:75041)
Lustre: lustre-OST0006: new connection from lustre-MDT0000-mdtlov (cleaning up unused objects from 0x3c0000400:74979 to 0x3c0000400:75009)
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm4.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm5.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-46vm5.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-46vm4.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: 57726:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982356/real 1744982356] req@00000000bbcb7d2c x1829702233358848/t0(0) o400->lustre-MDT0000-lwp-OST0006@10.240.28.195@tcp:12/10 lens 224/224 e 0 to 1 dl 1744982401 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: 57726:0:(client.c:2346:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: 57726:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982361/real 1744982361] req@0000000006b7951d x1829702233359232/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.28.195@tcp:12/10 lens 224/224 e 0 to 1 dl 1744982406 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: 57726:0:(client.c:2346:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-MDT0000-lwp-OST0004: Connection to lustre-MDT0000 (at 10.240.28.195@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 5 previous similar messages
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
LustreError: 1251351:0:(client.c:1282:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@000000008b84d2ed x1829702233388928/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.240.28.195@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:QU/200/ffffffff rc 0/-1 job:'qsd_reint_0.lus.0' uid:0 gid:0
LustreError: 1251351:0:(qsd_reint.c:38:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982443/real 1744982443] req@000000002dcadbe5 x1829702233386752/t0(0) o400->MGC10.240.28.195@tcp@10.240.28.195@tcp:26/25 lens 224/224 e 0 to 1 dl 1744982459 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
LustreError: MGC10.240.28.195@tcp: Connection to MGS (at 10.240.28.195@tcp) was lost; in progress operations using this service will fail
LustreError: 1251348:0:(obd_class.h:479:obd_check_dev()) Device 4 not setup
LustreError: 1251348:0:(obd_class.h:479:obd_check_dev()) Skipped 1 previous similar message
Lustre: server umount lustre-OST0000 complete
Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982443/real 1744982443] req@00000000872a75be x1829702233387392/t0(0) o400->lustre-MDT0000-lwp-OST0005@10.240.28.195@tcp:12/10 lens 224/224 e 0 to 1 dl 1744982488 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: lustre-MDT0000-lwp-OST0005: Connection to lustre-MDT0000 (at 10.240.28.195@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 2 previous similar messages
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1744982478/real 1744982478] req@00000000e9775cea x1829702233391488/t0(0) o400->lustre-MDT0000-lwp-OST0005@10.240.28.195@tcp:12/10 lens 224/224 e 0 to 1 dl 1744982523 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: 57725:0:(client.c:2346:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: dmsetup table /dev/mapper/ost1_flakey
Lustre: DEBUG MARKER: dmsetup remove /dev/mapper/ost1_flakey
Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1
Link to test
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev i2c_piix4 pcspkr virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover failover serio_raw virtio_blk
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Not tainted 4.18.0-553.46.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb0c100753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000011addb867 RBX: fffff094c46b76c0 RCX: 0000000000000200
RDX: 7ffffffee5224798 RSI: ffff9b6e5addb000 RDI: ffff9b6e43c8d000
RBP: 000056361488d000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffffc R12: ffff9b6e501a2468
R13: ffff9b6e7bd88000 R14: fffff094c40f2340 R15: ffff9b6e5f174910
FS: 0000000000000000(0000) GS:ffff9b6e7bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055728771cbfc CR3: 0000000010c10006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G L -------- - - 4.18.0-553.46.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Link to test
sanity-pcc test 3a: Repeat attach/detach operations
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:36]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon joydev pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover virtio_blk failover
CPU: 1 PID: 36 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.46.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa4664075bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000043d48867 RBX: ffffd9dc010f5200 RCX: 0000000000000200
RDX: 7fffffffbc2b7798 RSI: ffff92bd83d48000 RDI: ffff92bd85d7f000
RBP: 000055b09f57f000 R08: ffff92bdffd00000 R09: ffffffffad5c6880
R10: 0000000000000007 R11: 00000000ffffffeb R12: ffff92bd7ae98bf8
R13: ffff92bdffc50000 R14: ffffd9dc01175fc0 R15: ffff92bd7aeb5cb0
FS: 0000000000000000(0000) GS:ffff92bdffd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fff507f8b78 CR3: 000000004d010006 CR4: 00000000001706e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 36 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.46.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param debug=-1 debug_mb=150
LustreError: 64168:0:(mdt_open.c:1883:mdt_orphan_open()) lustre-MDT0001: cannot create volatile file [0x2400032e3:0x3:0x0]: rc = -11
LustreError: 64168:0:(mdt_open.c:2133:mdt_hsm_release()) lustre-MDT0001: cannot open orphan file [0x2400032e3:0x3:0x0]: rc = -11
Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-pcc test_3a: @@@@@@ FAIL: failed to attach file \/mnt\/lustre\/d3a.sanity-pcc\/f3a.sanity-pcc
Lustre: DEBUG MARKER: sanity-pcc test_3a: @@@@@@ FAIL: failed to attach file /mnt/lustre/d3a.sanity-pcc/f3a.sanity-pcc
Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2025-04-09/lustre-b_es-reviews_review-dne-exa6-part-1_22924_72_0c4f65af-8d13-4971-abc6-a6eadfa505a4//sanity-pcc.test_3a.debug_log.$(hostname -s).1744205437.log;
Autotest: Test running for 40 minutes (lustre-b_es-reviews_review-dne-exa6-part-1_22924.72)
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34]
Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul virtio_balloon ghash_clmulni_intel joydev pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw virtio_blk failover
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.44.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffbed840753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000001f6bb865 RBX: fffff169007daec0 RCX: 0000000000000200
RDX: 7fffffffe094479a RSI: ffff9d779f6bb000 RDI: ffff9d77abd9e000
RBP: 0000562c5539e000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffffa R12: ffff9d77850e0cf0
R13: ffff9d77a2c6d000 R14: fffff16900af6780 R15: ffff9d7784568000
FS: 0000000000000000(0000) GS:ffff9d783fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe3b358f024 CR3: 0000000021210002 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.44.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_112217.11)
Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_112217.11)
Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_112217.11)
Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_112217.11)
Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_112217.11)
Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_112217.11)
Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_112217.11)
Link to test
sanity test 314: OSP shouldn't fail after last_rcvd update failure
watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [khugepaged:34]
Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) ib_core tcp_diag inet_diag loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_blk virtio_net net_failover failover [last unloaded: libcfs]
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.44.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 1.16.0-4.module+el8.8.0+1454+0b2cbfb8 04/01/2014
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffad1800753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000021646865 RBX: ffffcdc900859180 RCX: 0000000000000200
RDX: 7fffffffde9b979a RSI: ffff9fcba1646000 RDI: ffff9fcb92bac000
RBP: 0000564185dac000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff0 R12: ffff9fcb8447bd60
R13: ffff9fcbe765a800 R14: ffffcdc9004aeb00 R15: ffff9fcbb5951ae0
FS: 0000000000000000(0000) GS:ffff9fcc3fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005641864c68e0 CR3: 0000000065c10005 CR4: 0000000000170ef0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.44.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 1.16.0-4.module+el8.8.0+1454+0b2cbfb8 04/01/2014
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 160 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 165 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 170 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 175 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 180 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 185 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 190 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 195 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 200 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 205 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 210 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 215 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 220 minutes (lustre-reviews_custom_112026.1004)
Autotest: Test running for 225 minutes (lustre-reviews_custom_112026.1004)
Lustre: 468035:0:(client.c:2346:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1743103738/real 0] req@ffff9fcbef680000 x1827770860516864/t0(0) o400->lustre-MDT0000-mdc-ffff9fcb8601b800@10.240.26.6@tcp:12/10 lens 224/224 e 0 to 1 dl 1743103754 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: lustre-MDT0000-mdc-ffff9fcb8601b800: Connection to lustre-MDT0000 (at 10.240.26.6@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: MGC10.240.26.6@tcp: Connection to MGS (at 10.240.26.6@tcp) was lost; in progress operations using this service will fail
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr virtio_balloon intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 joydev pcspkr sunrpc ext4 mbcache jbd2 ata_generic crc32c_intel ata_piix virtio_net serio_raw libata virtio_blk net_failover failover
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.40.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb4070074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000019441867 RBX: ffffed2940651040 RCX: 0000000000000200
RDX: 7fffffffe6bbe798 RSI: ffff9b49d9441000 RDI: ffff9b4a183b3000
RBP: 000056237cbb3000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffe9 R12: ffff9b49fb423d98
R13: ffff9b4a5da78000 R14: ffffed294160ecc0 R15: ffff9b49c62de910
FS: 0000000000000000(0000) GS:ffff9b4a7fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056238066ce88 CR3: 000000009c010004 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.40.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_111584.11)
Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_111584.11)
Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_111584.11)
Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_111584.11)
Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_111584.11)
Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_111584.11)
Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_111584.11)
Link to test
lustre-rsync-test test 3b: Replicate files created by writemany
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon i2c_piix4 sunrpc joydev pcspkr ext4 mbcache jbd2 ata_generic virtio_net crc32c_intel ata_piix libata serio_raw virtio_blk net_failover failover
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffff9b7780753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000001ef27867 RBX: ffffcc75007bc9c0 RCX: 0000000000000200
RDX: 7fffffffe10d8798 RSI: ffff899d1ef27000 RDI: ffff899d31bdf000
RBP: 000055e8301df000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffef R12: ffff899d05b40ef8
R13: ffff899d95448000 R14: ffffcc7500c6f7c0 R15: ffff899d30762e80
FS: 0000000000000000(0000) GS:ffff899dbfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f119fcc8024 CR3: 0000000093a10003 CR4: 00000000000606f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 25 minutes (lustre-reviews_review-dne-zfs-part-5_109672.16)
Lustre: lustre-OST0000-osc-ffff899d04463800: disconnect after 27s idle
Lustre: Skipped 3 previous similar messages
Autotest: Test running for 30 minutes (lustre-reviews_review-dne-zfs-part-5_109672.16)
Autotest: Test running for 35 minutes (lustre-reviews_review-dne-zfs-part-5_109672.16)
Lustre: 8596:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1734193367/real 0] req@ffff899d2efc8d00 x1818431180456064/t0(0) o400->lustre-MDT0000-mdc-ffff899d04463800@10.240.26.26@tcp:12/10 lens 224/224 e 0 to 1 dl 1734193383 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: lustre-MDT0000-mdc-ffff899d04463800: Connection to lustre-MDT0000 (at 10.240.26.26@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: MGC10.240.26.26@tcp: Connection to MGS (at 10.240.26.26@tcp) was lost; in progress operations using this service will fail
Lustre: 8595:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1734193372/real 0] req@ffff899d2efc8340 x1818431180456704/t0(0) o400->lustre-MDT0000-mdc-ffff899d04463800@10.240.26.26@tcp:12/10 lens 224/224 e 0 to 1 dl 1734193388 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: 8595:0:(client.c:2364:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Link to test
conf-sanity test 56a: check big OST indexes and out-of-index-order start
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr i2c_piix4 virtio_balloon intel_rapl_common crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover failover virtio_blk serio_raw [last unloaded: libcfs]
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffbc4c00753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000003c583867 RBX: ffffebccc0f160c0 RCX: 0000000000000200
RDX: 7fffffffc3a7c798 RSI: ffff9c86bc583000 RDI: ffff9c86daf83000
RBP: 0000556609d83000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 0000000000000000 R12: ffff9c86c29b3c18
R13: ffff9c86a9a60000 R14: ffffebccc16be0c0 R15: ffff9c8683a292b8
FS: 0000000000000000(0000) GS:ffff9c873fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f32fb6b9d05 CR3: 0000000028010004 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost8' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost1_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost1_flakey
LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost2_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=1 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost2_flakey
LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost3_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=2 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost3_flakey
LDISKFS-fs (dm-13): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost4_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=3 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost4_flakey
LDISKFS-fs (dm-14): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost5_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=4 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost5_flakey
LDISKFS-fs (dm-15): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost6_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=5 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost6_flakey
LDISKFS-fs (dm-16): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost7_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=6 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost7_flakey
LDISKFS-fs (dm-17): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost8_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost8' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=7 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost8_flakey
LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --index=10000 --reformat /dev/mapper/ost1_flakey
LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=1 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --index=1000 --reformat /dev/mapper/ost2_flakey
LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey 2>&1
Lustre: DEBUG MARKER: test -b /dev/mapper/ost1_flakey
Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1; mount -t lustre -o localrecov /dev/mapper/ost1_flakey /mnt/lustre-ost1
LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: user_xattr,acl,no_mbcache,nodelalloc
Lustre: 631343:0:(mgc_request_server.c:553:mgc_llog_local_copy()) MGC10.240.43.6@tcp: no remote llog for lustre-sptlrpc, check MGS config
Lustre: lustre-OST2710: new disk, initializing
Lustre: srv-lustre-OST2710: No data found on store. Initialize space.
Lustre: lustre-OST2710: Imperative Recovery not enabled, recovery window 60-180
Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param seq.cli-lustre-OST2710-super.width=16384
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Lustre: DEBUG MARKER: sync; sleep 1; sync
Lustre: cli-lustre-OST2710-super: Allocated super-sequence [0x0000000300000400-0x0000000340000400]:2710:ost]
Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST2710-osc-[-0-9a-f]\*.ost_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST2710-osc-[-0-9a-f]\*.ost_server_uuid
Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST2710-osc-[-0-9a-f]*.ost_server_uuid
Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST2710-osc-[-0-9a-f]*.ost_server_uuid
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost2_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost2_flakey 2>&1
Lustre: DEBUG MARKER: test -b /dev/mapper/ost2_flakey
Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2; mount -t lustre -o localrecov /dev/mapper/ost2_flakey /mnt/lustre-ost2
LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: errors=remount-ro
LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: user_xattr,acl,no_mbcache,nodelalloc
Lustre: lustre-OST03e8: new disk, initializing
Lustre: srv-lustre-OST03e8: No data found on store. Initialize space.
Lustre: cli-lustre-OST03e8-super: Allocated super-sequence [0x0000000340000400-0x0000000380000400]:3e8:ost]
Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param seq.cli-lustre-OST03e8-super.width=16384
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_default_debug -1 all
Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST03e8-osc-[-0-9a-f]\*.ost_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST03e8-osc-[-0-9a-f]\*.ost_server_uuid
Lustre: DEBUG MARKER: trevis-84vm4.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST03e8-osc-[-0-9a-f]*.ost_server_uuid
Lustre: DEBUG MARKER: trevis-84vm5.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST03e8-osc-[-0-9a-f]*.ost_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0000.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0000.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST2710-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST2710-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0001.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0001.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST2710-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST2710-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0002.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0002.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST2710-osc-MDT0002.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST2710-osc-MDT0002.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0003.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST2710-osc-MDT0003.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST2710-osc-MDT0003.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST2710-osc-MDT0003.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0000.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0000.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST03e8-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST03e8-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0001.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0001.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST03e8-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST03e8-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0002.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0002.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST03e8-osc-MDT0002.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST03e8-osc-MDT0002.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0003.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST03e8-osc-MDT0003.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST03e8-osc-MDT0003.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST03e8-osc-MDT0003.ost_server_uuid in FULL state after 0 sec
Lustre: lustre-MDT0000-lwp-OST2710: Connection to lustre-MDT0000 (at 10.240.43.6@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-MDT0002-lwp-OST2710: Connection to lustre-MDT0002 (at 10.240.43.6@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 3 previous similar messages
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
Lustre: 583597:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1734163919/real 1734163919] req@00000000dc73e880 x1818401634843648/t0(0) o400->MGC10.240.43.6@tcp@10.240.43.6@tcp:26/25 lens 224/224 e 0 to 1 dl 1734163935 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'' uid:0 gid:0
LustreError: MGC10.240.43.6@tcp: Connection to MGS (at 10.240.43.6@tcp) was lost; in progress operations using this service will fail
Autotest: Test running for 265 minutes (lustre-reviews_review-dne-part-3_109662.5)
Lustre: server umount lustre-OST2710 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost2
Lustre: server umount lustre-OST03e8 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost8' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost8' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-83vm4.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-84vm7.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: trevis-83vm4.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: trevis-84vm7.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-69vm7.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: trevis-69vm7.trevis.whamcloud.com: executing set_hostid
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost1_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost1_flakey
LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost2_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=1 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost2_flakey
LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost3_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=2 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost3_flakey
LDISKFS-fs (dm-13): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost4_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=3 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost4_flakey
LDISKFS-fs (dm-14): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost5_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=4 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost5_flakey
LDISKFS-fs (dm-15): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Autotest: Test running for 270 minutes (lustre-reviews_review-dne-part-3_109662.5)
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost6_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=5 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost6_flakey
LDISKFS-fs (dm-16): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Lustre: DEBUG MARKER: [ -e /dev/mapper/ost7_flakey ]
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts || true
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=10.240.43.6@tcp --fsname=lustre --ost --index=6 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-b 4096 -E lazy_itable_init" --reformat /dev/mapper/ost7_flakey
LDISKFS-fs (dm-17): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Link to test
sanity test 256: Check llog delete for empty and not full state
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34]
Modules linked in: dm_flakey obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover serio_raw failover virtio_blk [last unloaded: dm_flakey]
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G W OE -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffbd34c074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000006c10b867 RBX: ffffdd7701b042c0 RCX: 0000000000000200
RDX: 7fffffff93ef4798 RSI: ffff93b6ac10b000 RDI: ffff93b6b1cc7000
RBP: 000055fa91ec7000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff4 R12: ffff93b686de9638
R13: ffff93b6dac70000 R14: ffffdd7701c731c0 R15: ffff93b679568bc8
FS: 0000000000000000(0000) GS:ffff93b6ffd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055fa9417a478 CR3: 0000000099210004 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G W OEL -------- - - 4.18.0-553.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param mdd.lustre-MDT0000.changelog_mask -n
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdd.lustre-MDT0000.changelog_mask=+hsm
Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 changelog register -n
Lustre: lustre-MDD0000: changelog on
Autotest: Test running for 245 minutes (lustre-reviews_review-ldiskfs-ubuntu_109580.43)
Link to test
sanity test 65k: validate manual striping works properly with deactivated OSCs
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel virtio_net libata net_failover failover virtio_blk serio_raw
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffabb340753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000002336c867 RBX: ffffd2c3008cdb00 RCX: 0000000000000200
RDX: 7fffffffdcc93798 RSI: ffff9b03e336c000 RDI: ffff9b03eabcc000
RBP: 000055e3561cc000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffe9 R12: ffff9b03d026ae60
R13: ffff9b040024d000 R14: ffffd2c300aaf300 R15: ffff9b03c36473a0
FS: 0000000000000000(0000) GS:ffff9b047fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa1097f3420 CR3: 000000003e810005 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 60 minutes (lustre-reviews_review-dne-zfs-part-1_109517.12)
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=trace+inode+super+iotrace+malloc+cache+info+ioctl+neterror+net+warning+buffs+other+dentry+nettrace+page+dlmtrace+error+emerg+ha+rpctrace+vfstrace+reada+mmap+config+console+quota+sec+lfsck+hsm+snapshot+layout
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=trace+inode+super+iotrace+malloc+cache+info+ioctl+neterror+net+warning+buffs+other+dentry+nettrace+page+dlmtrace+error+emerg+ha+rpctrace+vfstrace+reada+mmap+config+console+quota+sec+lfsck+hsm+snapshot+layout
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-107vm16.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-107vm16.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-107vm17.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-107vm17.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-107vm16.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-107vm16.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-107vm17.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 50
Lustre: DEBUG MARKER: trevis-107vm17.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 50
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=trace+inode+super+iotrace+malloc+cache+info+ioctl+neterror+net+warning+buffs+other+dentry+nettrace+page+dlmtrace+error+emerg+ha+rpctrace+vfstrace+reada+mmap+config+console+quota+sec+lfsck+hsm+snapshot+layout
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0
Link to test
parallel-scale test write_disjoint_tiny: write_disjoint_tiny
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: ib_core mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr joydev virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_blk serio_raw virtio_net net_failover failover
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb3b540753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 00000000085fc867 RBX: ffffdfb200217f00 RCX: 0000000000000200
RDX: 7ffffffff7a03798 RSI: ffff8ec8085fc000 RDI: ffff8ec813ac8000
RBP: 00007f9cc0ac8000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff3 R12: ffff8ec92898c640
R13: ffff8ec93bc40000 R14: ffffdfb2004eb200 R15: ffff8ec939f0b570
FS: 0000000000000000(0000) GS:ffff8ec93bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9cc0379024 CR3: 0000000107810006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.27.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 395 minutes (lustre-master_full-part-1_4595.1)
Autotest: Test running for 400 minutes (lustre-master_full-part-1_4595.1)
Autotest: Test running for 405 minutes (lustre-master_full-part-1_4595.1)
Link to test
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 joydev pcspkr virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic virtio_net ata_piix net_failover crc32c_intel libata failover virtio_blk serio_raw
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Not tainted 4.18.0-513.24.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffbbf580753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 00000000113ad825 RBX: fffff975c044eb40 RCX: 0000000000000200
RDX: 7fffffffeec527da RSI: ffffa094113ad000 RDI: ffffa0940e5ad000
RBP: 00005602a25ad000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 0000000000000000 R12: ffffa094047cdd68
R13: ffffa0947fc4a800 R14: fffff975c0396b40 R15: ffffa0940431a2b8
FS: 0000000000000000(0000) GS:ffffa094bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb66f573000 CR3: 000000007d610004 CR4: 00000000001706f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G L --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Link to test
sanity test 77l: preferred checksum type is remembered after reconnected
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: obdecho(OE) ptlrpc_gss(OE) osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_blk virtio_net net_failover failover [last unloaded: lnet_selftest]
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffb3bd80753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000055ae7865 RBX: ffffeb784156b9c0 RCX: 0000000000000200
RDX: 7fffffffaa51879a RSI: ffff8cda15ae7000 RDI: ffff8cda5c3b6000
RBP: 000055cc40db6000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffff R12: ffff8cda4e37adb0
R13: ffff8cd9e0450000 R14: ffffeb784270ed80 R15: ffff8cd9c5bbd488
FS: 0000000000000000(0000) GS:ffff8cda7fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4d0a5fced0 CR3: 000000001de10006 CR4: 00000000001706f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to invalid, rc = 22
Lustre: DEBUG MARKER: set checksum type to invalid, rc = 22
Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to crc32, rc = 0
Lustre: DEBUG MARKER: set checksum type to crc32, rc = 0
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 0 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to adler, rc = 0
Lustre: DEBUG MARKER: set checksum type to adler, rc = 0
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Autotest: Test running for 120 minutes (lustre-reviews_review-ldiskfs-dne-arm_108974.57)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 17 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 17 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to crc32c, rc = 0
Lustre: DEBUG MARKER: set checksum type to crc32c, rc = 0
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to t10ip512, rc = 0
Lustre: DEBUG MARKER: set checksum type to t10ip512, rc = 0
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark set checksum type to t10ip4K, rc = 0
Lustre: DEBUG MARKER: set checksum type to t10ip4K, rc = 0
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state IDLE osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in IDLE state after 16 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: trevis-108vm7.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff000025090000.ost_server_uuid in FULL state after 0 sec
Link to test
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul i2c_piix4 ghash_clmulni_intel virtio_balloon joydev pcspkr sunrpc dm_mod ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net virtio_blk net_failover failover
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Not tainted 4.18.0-513.5.1.el8_9.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 80 44 40 00 9d 30 c0 e9 78 44 40 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 61 44 40 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffff9ba940753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000025948867 RBX: fffff0c280965200 RCX: 0000000000000200
RDX: 7fffffffda6b7798 RSI: ffff88be65948000 RDI: ffff88beff1fb000
RBP: 000055dbf39fb000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffec R12: ffff88be45c1bfd8
R13: ffff88be5c47d000 R14: fffff0c282fc7ec0 R15: ffff88be44ee1e80
FS: 0000000000000000(0000) GS:ffff88beffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f45e8010de8 CR3: 0000000019e10005 CR4: 00000000001706f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G L --------- - - 4.18.0-513.5.1.el8_9.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Link to test
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34]
Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_mod sd_mod t10_pi sg iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sunrpc pcspkr joydev virtio_balloon i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel libata virtio_net net_failover virtio_blk failover serio_raw
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Not tainted 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa64700753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000013cb7867 RBX: ffffedbb004f2dc0 RCX: 0000000000000200
RDX: 7fffffffec348798 RSI: ffff97a513cb7000 RDI: ffff97a5424c4000
RBP: 00005645238c4000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff7 R12: ffff97a5061dd620
R13: ffff97a556260000 R14: ffffedbb01093100 R15: ffff97a5037d4000
FS: 0000000000000000(0000) GS:ffff97a5bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ec00718a0c CR3: 0000000054810004 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G L --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Link to test
sanity test 39k: write, utime, close, stat
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr i2c_piix4 intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon joydev pcspkr ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_blk virtio_net net_failover failover
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb74b80753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000005b31a867 RBX: ffffe906016cc680 RCX: 0000000000000200
RDX: 7fffffffa4ce5798 RSI: ffff8af05b31a000 RDI: ffff8af04f37e000
RBP: 00005573f077e000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffec R12: ffff8af022c33bf0
R13: ffff8af029052800 R14: ffffe906013cdf80 R15: ffff8af02da88570
FS: 0000000000000000(0000) GS:ffff8af0bfd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005573f1ce6038 CR3: 0000000026a10003 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 130 minutes (lustre-reviews_review-dne-zfs-part-4_108666.33)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_39k: @@@@@@ FAIL: mtime is lost on close: 1730302440, should be 1698679567
Lustre: DEBUG MARKER: sanity test_39k: @@@@@@ FAIL: mtime is lost on close: 1730302440, should be 1698679567
Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2024-10-30/lustre-reviews_review-dne-zfs-part-4_108666_33_4a560301-b4bd-4ca3-9f01-ca4d8691e89b//sanity.test_39k.debug_log.$(hostname -s).1730302589.log;
Autotest: Test running for 135 minutes (lustre-reviews_review-dne-zfs-part-4_108666.33)
Link to test
sanity-compr test iozone: iozone
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34]
Modules linked in: mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) psample mlxfw(OE) intel_rapl_msr intel_rapl_common tls crct10dif_pclmul pci_hyperv_intf crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon pcspkr joydev sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_blk virtio_net net_failover failover
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 a0 6e 41 00 9d 30 c0 e9 98 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 81 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffbe2d80753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000002aebd865 RBX: ffffe34fc0abaf40 RCX: 0000000000000200
RDX: 7fffffffd514279a RSI: ffff99bd6aebd000 RDI: ffff99bd6cee4000
RBP: 0000562030ae4000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff0 R12: ffff99bd6291e720
R13: ffff99bdffc52800 R14: ffffe34fc0b3b900 R15: ffff99bd44475e80
FS: 0000000000000000(0000) GS:ffff99bdffd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055555711c928 CR3: 0000000028210005 CR4: 00000000001706e0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 7189292kB available, using 5444512kB file size
Lustre: DEBUG MARKER: min OST has 7189292kB available, using 5444512kB file size
Autotest: Test running for 445 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 450 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 455 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 460 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 465 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 470 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 475 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 480 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 485 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 490 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 495 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 500 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 505 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Autotest: Test running for 510 minutes (lustre-b_es6_0_full-part-exa6_712.169)
Link to test
sanity-pcc test 1f: Test auto RW-PCC cache with non-root user
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:35]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon pcspkr joydev ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw failover virtio_blk
CPU: 1 PID: 35 Comm: khugepaged Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffb96040753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000001cf4c867 RBX: ffffdeef8073d300 RCX: 0000000000000200
RDX: 7fffffffe30b3798 RSI: ffff8c1a9cf4c000 RDI: ffff8c1acdd65000
RBP: 00007f9527d65000 R08: ffffdeef80c6a008 R09: ffffdeef805531c8
R10: 0000000000000000 R11: 0000000000000009 R12: ffff8c1ac83e3b28
R13: ffff8c1ab6670000 R14: ffffdeef81375940 R15: ffff8c1ad17f9570
FS: 0000000000000000(0000) GS:ffff8c1b3fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f952c6d4000 CR3: 0000000021810001 CR4: 00000000001706e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 35 Comm: khugepaged Kdump: loaded Tainted: P OEL -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: zpool get all
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
Lustre: lustre-MDT0000: Not available for connect from 10.240.24.50@tcp (stopping)
LustreError: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107
LustreError: Skipped 1 previous similar message
Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 2 previous similar messages
Lustre: server umount lustre-MDT0000 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
LustreError: 60757:0:(ldlm_lockd.c:2575:ldlm_cancel_handler()) ldlm_cancel from 10.240.28.49@tcp arrived at 1727086208 with bad export cookie 4081145193744231493
LustreError: 60757:0:(ldlm_lockd.c:2575:ldlm_cancel_handler()) Skipped 4 previous similar messages
Lustre: 12409:0:(client.c:2363:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1727086195/real 1727086195] req@ffff8c1ad38ce3c0 x1810978681027840/t0(0) o400->MGC10.240.28.48@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1727086211 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
LustreError: MGC10.240.28.48@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
LustreError: lustre-MDT0000: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 113 previous similar messages
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3
Lustre: server umount lustre-MDT0002 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Link to test
sanityn test 16k: Parallel FSX and drop caches should not panic
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 joydev pcspkr virtio_balloon ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel libata serio_raw virtio_net net_failover failover virtio_blk
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffff9e7540753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000004cff7867 RBX: ffffee29c133fdc0 RCX: 0000000000000200
RDX: 7fffffffb3008798 RSI: ffff88f04cff7000 RDI: ffff88f059a33000
RBP: 0000558c59633000 R08: ffff88f0bfd00000 R09: ffffffff8c3c6880
R10: 0000000000000007 R11: 00000000fffffff2 R12: ffff88f047e1b198
R13: ffff88f02484a800 R14: ffffee29c1668cc0 R15: ffff88f005c28000
FS: 0000000000000000(0000) GS:ffff88f0bfd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055d17cb05fb0 CR3: 0000000022e10002 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 230 minutes (lustre-reviews_review-dne-zfs-part-5_107746.41)
Autotest: Test running for 235 minutes (lustre-reviews_review-dne-zfs-part-5_107746.41)
Autotest: Test running for 240 minutes (lustre-reviews_review-dne-zfs-part-5_107746.41)
Autotest: Test running for 245 minutes (lustre-reviews_review-dne-zfs-part-5_107746.41)
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#1 stuck for 24s! [khugepaged:34]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss intel_rapl_msr nfsv4 intel_rapl_common crct10dif_pclmul crc32_pclmul dns_resolver nfs lockd grace ghash_clmulni_intel fscache joydev pcspkr virtio_balloon i2c_piix4 sunrpc ata_generic ext4 mbcache jbd2 ata_piix libata virtio_net crc32c_intel net_failover serio_raw failover virtio_blk
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa43780753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000139303867 RBX: ffffd6a944e4c0c0 RCX: 0000000000000200
RDX: 7ffffffec6cfc798 RSI: ffff8bb0b9303000 RDI: ffff8bb03fcf9000
RBP: 00007f5da06f9000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff4 R12: ffff8bb0861c97c8
R13: ffff8bb0bbd88000 R14: ffffd6a942ff3e40 R15: ffff8bb083842740
FS: 0000000000000000(0000) GS:ffff8bb0bbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055e6f592f000 CR3: 0000000026010001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? syscall_return_via_sysret+0x6e/0x94
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_107315.29)
Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_107315.29)
Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_107315.29)
Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_107315.29)
Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_107315.29)
Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_107315.29)
Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_107315.29)
Autotest: Test running for 55 minutes (lustre-reviews_review-dne-part-9_107315.29)
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon joydev pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net net_failover crc32c_intel failover serio_raw virtio_blk
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.8.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb25640753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000014194867 RBX: ffffeec340506500 RCX: 0000000000000200
RDX: 7fffffffebe6b798 RSI: ffff99ad14194000 RDI: ffff99ad4986f000
RBP: 0000561bd1c6f000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffe8 R12: ffff99ad03ead378
R13: ffff99ad5464d000 R14: ffffeec341261bc0 R15: ffff99ad05ea63a0
FS: 0000000000000000(0000) GS:ffff99adbfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005563520c03dc CR3: 0000000052c10003 CR4: 00000000000606f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.8.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 15 minutes (lustre-reviews_review-dne-part-9_106849.29)
Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_106849.29)
Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_106849.29)
Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_106849.29)
Link to test
sanity-sec test 27a: test fileset in various nodemaps
watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [khugepaged:36]
Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 joydev virtio_balloon pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel virtio_blk serio_raw net_failover failover [last unloaded: dm_flakey]
CPU: 0 PID: 36 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.8.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffff9c2c8075bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000030fc2867 RBX: ffffec7140c3f080 RCX: 0000000000000200
RDX: 7fffffffcf03d798 RSI: ffff8c6530fc2000 RDI: ffff8c65617d0000
RBP: 00007f99e25d0000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffffe R12: ffff8c653fc63e80
R13: ffff8c6536678000 R14: ffffec714185f400 R15: ffff8c653f8c2e80
FS: 0000000000000000(0000) GS:ffff8c65bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000557e0c6ff730 CR3: 0000000024810002 CR4: 00000000001706f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 36 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.8.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.active
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.fileset
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.fileset
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.active
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.*.identity_upcall=NONE
Lustre: 337649:0:(mdt_lproc.c:315:identity_upcall_store()) lustre-MDT0001: disable "identity_upcall" with ACL enabled maybe cause unexpected "EACCESS"
Lustre: 337649:0:(mdt_lproc.c:315:identity_upcall_store()) Skipped 5 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.active
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.default.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.admin_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.fileset
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
Link to test
sanity test 209: read-only open/close requests should be freed promptly
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:33]
Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 210: lfs getstripe does not break leases ============================================== 01:13:03 \(1723165983\)
Modules linked in: mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw virtio_blk net_failover failover
CPU: 0 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.19.2.el8_7.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 d0 6e 22 00 9d 30 c0 e9 c8 6e 22 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 b1 6e 22 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb5290074bd10 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000003ad41867 RBX: ffffdfab40eb5040 RCX: 0000000000000200
RDX: 0000000000000000 RSI: ffff97433ad41000 RDI: ffff97432f5ef000
RBP: 00007f34935ef000 R08: ffffdfab41364888 R09: ffff9743bffcf000
R10: 0000000000000002 R11: 0000000000000009 R12: ffff974304e6cf78
R13: ffff974380e50000 R14: ffffdfab40bd7bc0 R15: ffff9743482e2740
FS: 0000000000000000(0000) GS:ffff9743bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ab7c5add1c CR3: 000000007f410006 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x8e4/0x1010
khugepaged+0xed0/0x11e0
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x10b/0x130
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-425.19.2.el8_7.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
Lustre: 47087:0:(client.c:2325:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1723166009/real 1723166009] req@00000000dded327b x1806857619793280/t0(0) o400->MGC10.240.24.153@tcp@10.240.24.153@tcp:26/25 lens 224/224 e 0 to 1 dl 1723166018 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker.0'
Lustre: 47087:0:(client.c:2325:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
LustreError: 166-1: MGC10.240.24.153@tcp: Connection to MGS (at 10.240.24.153@tcp) was lost; in progress operations using this service will fail
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: 47086:0:(client.c:2325:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1723165587/real 1723165587] req@0000000071f8c5a8 x1806857619776256/t0(0) o400->lustre-MDT0000-mdc-ffff9743050f2000@10.240.24.153@tcp:12/10 lens 224/224 e 0 to 1 dl 1723165630 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker.0'
Autotest: Test running for 205 minutes (lustre-b_es6_0_rolling-upgrade-client2_689.266)
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache virtio_balloon intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover virtio_blk serio_raw failover
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.8.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb312c0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000001c697867 RBX: ffffedbdc071a5c0 RCX: 0000000000000200
RDX: 7fffffffe3968798 RSI: ffff9dcd5c697000 RDI: ffff9dcd7470a000
RBP: 0000563fd630a000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffffc R12: ffff9dcd45102850
R13: ffff9dcdd8448000 R14: ffffedbdc0d1c280 R15: ffff9dcd45c3fcb0
FS: 0000000000000000(0000) GS:ffff9dcdffd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555556d50928 CR3: 0000000096a10003 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.8.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_106337.29)
Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_106337.29)
Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_106337.29)
Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_106337.29)
Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_106337.29)
Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_106337.29)
Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_106337.29)
Autotest: Test running for 55 minutes (lustre-reviews_review-dne-part-9_106337.29)
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 ata_generic mbcache jbd2 ata_piix libata virtio_net crc32c_intel serio_raw net_failover virtio_blk failover
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.8.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa174c0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000002b087867 RBX: ffffce26c0ac21c0 RCX: 0000000000000200
RDX: 7fffffffd4f78798 RSI: ffff94fd2b087000 RDI: ffff94fd2708d000
RBP: 00005568e088d000 R08: ffff94fdbfd00000 R09: ffffffff8a5c5860
R10: 0000000000000007 R11: 00000000fffffffc R12: ffff94fd2236d468
R13: ffff94fd5464d000 R14: ffffce26c09c2340 R15: ffff94fd0605d2b8
FS: 0000000000000000(0000) GS:ffff94fdbfd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9cb629a000 CR3: 0000000052c10004 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.8.1.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_106152.29)
Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_106152.29)
Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_106152.29)
Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_106152.29)
Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_106152.29)
Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_106152.29)
Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_106152.29)
Autotest: Test running for 55 minutes (lustre-reviews_review-dne-part-9_106152.29)
Lustre: 8603:0:(client.c:2362:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1721091491/real 0] req@ffff94fd0fafb0c0 x1804691444261888/t0(0) o400->lustre-MDT0001-mdc-ffff94fd03f5d800@10.240.23.50@tcp:12/10 lens 224/224 e 0 to 1 dl 1721091507 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: lustre-MDT0001-mdc-ffff94fd03f5d800: Connection to lustre-MDT0001 (at 10.240.23.50@tcp) was lost; in progress operations using this service will wait for recovery to complete
Link to test
replay-single test 135: Server failure in lock replay phase
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover virtio_blk failover
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffa94480753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000002b0b7867 RBX: ffffe103c0ac2dc0 RCX: 0000000000000200
RDX: 7fffffffd4f48798 RSI: ffff94cbeb0b7000 RDI: ffff94cbd3b04000
RBP: 00007fea35b04000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffe9 R12: ffff94cc1fcde820
R13: ffff94cc28660000 R14: ffffe103c04ec100 R15: ffff94cc1c551570
FS: 0000000000000000(0000) GS:ffff94cc7fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fea341b2024 CR3: 0000000066c10003 CR4: 00000000000606e0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: sync; sync; sync
Autotest: Test running for 205 minutes (lustre-master_full-dne-zfs-part-1_4552.10)
Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-OST0000 notransno
Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-OST0000 readonly
LustreError: 342105:0:(osd_handler.c:698:osd_ro()) lustre-OST0000: *** setting device osd-zfs read-only ***
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000
Link to test
sanity-pcc test 20: Auto attach works after the inode was once evicted from cache
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic crc32c_intel virtio_net ata_piix libata net_failover failover serio_raw virtio_blk
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffbae280753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000053d2d867 RBX: ffffe266c14f4b40 RCX: 0000000000000200
RDX: 7fffffffac2d2798 RSI: ffff9b0c93d2d000 RDI: ffff9b0c9c7f3000
RBP: 0000561ccdff3000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff4 R12: ffff9b0c79e7ef98
R13: ffff9b0c5a65d000 R14: ffffe266c171fcc0 R15: ffff9b0c6f52fae0
FS: 0000000000000000(0000) GS:ffff9b0cffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffda90d7224 CR3: 0000000018c10003 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param debug=-1 debug_mb=150
Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-pcc test_20: @@@@@@ FAIL: \/mnt\/lustre\/f20.sanity-pcc expected pcc state: readwrite, but got: none
Lustre: DEBUG MARKER: sanity-pcc test_20: @@@@@@ FAIL: /mnt/lustre/f20.sanity-pcc expected pcc state: readwrite, but got: none
Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2024-07-07/lustre-b_es-reviews_review-dne-part-7_19423_16_0fb3f3a1-5141-4af6-8a5e-eeb2d13c9cc1//sanity-pcc.test_20.debug_log.$(hostname -s).1720388886.log;
Autotest: Test running for 195 minutes (lustre-b_es-reviews_review-dne-part-7_19423.16)
Lustre: lustre-MDT0001: Client c497ef3c-da29-4a94-81f5-e8dd962f064b (at 10.240.39.85@tcp) reconnecting
Lustre: lustre-MDT0003: Received new MDS connection from 10.240.39.89@tcp, keep former export from same NID
Lustre: HSM agent c497ef3c-da29-4a94-81f5-e8dd962f064b already registered
Lustre: 386229:0:(service.c:2157:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 8s req@00000000b8b4df30 x1803946525602432/t0(0) o400->lustre-MDT0002-mdtlov_UUID@10.240.39.89@tcp:0/0 lens 224/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 job:'kworker.0'
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34]
Modules linked in: mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr virtio_balloon joydev i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw virtio_blk failover
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa88680753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000022788867 RBX: ffffde970089e200 RCX: 0000000000000200
RDX: 7fffffffdd877798 RSI: ffff9af0a2788000 RDI: ffff9af0a2f1e000
RBP: 000055cd3111e000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffe8 R12: ffff9af0842468f0
R13: ffff9af13fd48000 R14: ffffde97008bc780 R15: ffff9af09b1dd3a0
FS: 0000000000000000(0000) GS:ffff9af13fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555d9021ad2c CR3: 0000000009810002 CR4: 00000000000606e0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.el8_10.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 15 minutes (lustre-reviews_review-dne-part-9_105840.29)
Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_105840.29)
Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_105840.29)
Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_105840.29)
Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_105840.29)
Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_105840.29)
Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_105840.29)
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:31]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw virtio_blk net_failover failover
CPU: 0 PID: 31 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffa9b600733d48 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000077703867 RBX: ffffdbae41b43640 RCX: 0000000000000200
RDX: 7fffffff888fc798 RSI: ffff9596f7703000 RDI: ffff9596ed0d9000
RBP: 0000555eab8d9000 R08: ffffdbae42480518 R09: ffff95973ffce000
R10: 00000000000305c0 R11: ffffffffffffffe8 R12: ffffdbae41ddc0c0
R13: ffff959727d536c8 R14: ffff9596b64adf00 R15: ffff95971357f2b8
FS: 0000000000000000(0000) GS:ffff95973fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000313000 CR3: 000000003880a004 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x6b6/0xf10
khugepaged+0xb5b/0x1150
? finish_wait+0x80/0x80
? collapse_huge_page+0xf10/0xf10
kthread+0x112/0x130
? kthread_flush_work_fn+0x10/0x10
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 31 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x5c/0x80
panic+0xe7/0x2a9
? __switch_to_asm+0x51/0x70
watchdog_timer_fn.cold.8+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x100/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 115 minutes (lustre-b2_15_full-part-1_94.100)
Autotest: Test running for 120 minutes (lustre-b2_15_full-part-1_94.100)
Autotest: Test running for 125 minutes (lustre-b2_15_full-part-1_94.100)
Link to test
sanity-quota test 7a: Quota reintegration (global index)
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4 virtio_balloon ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw net_failover virtio_blk failover
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffff98cb00753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000000fa06867 RBX: ffffeb28403e8180 RCX: 0000000000000200
RDX: 7ffffffff05f9798 RSI: ffff8d310fa06000 RDI: ffff8d3144da3000
RBP: 000055e6577a3000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff9 R12: ffff8d3142587d18
R13: ffff8d312d44d000 R14: ffffeb28411368c0 R15: ffff8d313b550e80
FS: 0000000000000000(0000) GS:ffff8d31bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ba98fa01ec CR3: 000000002ba10003 CR4: 00000000001706f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8f2/0x1020
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL -------- - - 4.18.0-553.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0
Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.enabled
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.enabled
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.enabled
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.enabled
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.enabled
Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d /mnt/lustre-ost1
Lustre: Failing over lustre-OST0000
Lustre: server umount lustre-OST0000 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: ! zpool list -H lustre-ost1 >/dev/null 2>&1 ||
LustreError: lustre-OST0000: not available for connect from 10.240.26.176@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: lustre-OST0000: not available for connect from 10.240.26.186@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: lustre-OST0000: not available for connect from 10.240.26.176@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: lustre-OST0000: not available for connect from 10.240.26.186@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.enabled
LustreError: 57797:0:(qsd_reint.c:633:qqi_reint_delayed()) lustre-OST0001: Delaying reintegration for qtype:0 until pending updates are flushed.
LustreError: 57797:0:(qsd_reint.c:633:qqi_reint_delayed()) Skipped 5 previous similar messages
LustreError: lustre-OST0000: not available for connect from 10.240.26.186@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.enabled
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1
Lustre: DEBUG MARKER: lsmod | grep zfs >&/dev/null || modprobe zfs;
LustreError: lustre-OST0000: not available for connect from 10.240.26.176@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 2 previous similar messages
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-ost1/ost1
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1; mount -t lustre -o localrecov lustre-ost1/ost1 /mnt/lustre-ost1
LustreError: lustre-OST0000: not available for connect from 10.240.26.186@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 6 previous similar messages
Lustre: lustre-OST0000: Imperative Recovery enabled, recovery window shrunk from 60-180 down to 60-180
Lustre: lustre-OST0000: in recovery but waiting for the first client to connect
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-ost1/ost1 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param seq.cli-lustre-OST0000-super.width=65536
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
Lustre: lustre-OST0000: Recovery over after 0:09, of 2 clients 2 recovered and 0 were evicted.
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all
LustreError: 36287:0:(qsd_reint.c:633:qqi_reint_delayed()) lustre-OST0000: Delaying reintegration for qtype:0 until pending updates are flushed.
LustreError: 36287:0:(qsd_reint.c:633:qqi_reint_delayed()) Skipped 2 previous similar messages
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-ost1/ost1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-ost1/ost1 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=+quota+trace
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.info |
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.info |
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475
Autotest: Test running for 40 minutes (lustre-reviews_review-zfs_105753.4)
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.info |
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.info |
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.info |
Lustre: DEBUG MARKER: /usr/sbin/lctl dl
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: onyx-80vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.info |
Link to test
parallel-scale test write_append_truncate: write_append_truncate
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs intel_rapl_msr lockd intel_rapl_common grace fscache crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net virtio_blk net_failover failover [last unloaded: dm_flakey]
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb31c80753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000036fb9867 RBX: ffffede680dbee40 RCX: 0000000000000200
RDX: 7fffffffc9046798 RSI: ffff983af6fb9000 RDI: ffff983ac087d000
RBP: 0000562be067d000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff3 R12: ffff983b184e53e8
R13: ffff983b6946d000 R14: ffffede680021f40 R15: ffff983ac3c29740
FS: 0000000000000000(0000) GS:ffff983b7fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b47cd31da0 CR3: 00000000a7a10002 CR4: 00000000001706f0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 455 minutes (lustre-master_full-dne-part-1_4543.7)
Autotest: Test running for 460 minutes (lustre-master_full-dne-part-1_4543.7)
Autotest: Test running for 465 minutes (lustre-master_full-dne-part-1_4543.7)
Link to test
sanity-quota test 18: MDS failover while writing, no watchdog triggered (b14840)
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: ib_core nfsv3 nfsd nfs_acl mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover failover serio_raw virtio_blk
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.10.1.el8_8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 a0 7e 41 00 9d 30 c0 e9 98 7e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 81 7e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffac6a00753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000001359e845 RBX: fffffa2c404d6780 RCX: 0000000000000200
RDX: 7fffffffeca617ba RSI: ffff9cc71359e000 RDI: ffff9cc73405a000
RBP: 0000561167a5a000 R08: fffffa2c40b55d88 R09: 0000000000000000
R10: 0000000000000007 R11: 00000000fffffffd R12: ffff9cc7295312d0
R13: ffff9cc766865000 R14: fffffa2c40d01680 R15: ffff9cc7357bebc8
FS: 0000000000000000(0000) GS:ffff9cc7bfd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056548c7267d0 CR3: 0000000064e10002 CR4: 00000000000606e0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.10.1.el8_8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark User quota \(limit: 200\)
Lustre: DEBUG MARKER: User quota (limit: 200)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Write 100M \(buffered\) ...
Lustre: DEBUG MARKER: Write 100M (buffered) ...
Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Fail mds for 0 seconds
Lustre: DEBUG MARKER: Fail mds for 0 seconds
Lustre: 3716:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1717593672/real 1717593672] req@00000000c35ba9d6 x1801007592684736/t0(0) o400->lustre-MDT0000-mdc-ffff9cc7296c7800@10.240.30.158@tcp:12/10 lens 224/224 e 0 to 1 dl 1717593681 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 3716:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection to lustre-MDT0000 (at 10.240.30.158@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 166-1: MGC10.240.30.158@tcp: Connection to MGS (at 10.240.30.158@tcp) was lost; in progress operations using this service will fail
Lustre: Evicted from MGS (at 10.240.30.158@tcp) after server handle changed from 0x357653082f39c9cc to 0x357653082f494d50
Lustre: MGC10.240.30.158@tcp: Connection restored to 10.240.30.158@tcp (at 10.240.30.158@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-130vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
LustreError: 167-0: lustre-MDT0000-mdc-ffff9cc7296c7800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
LustreError: Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection restored to 10.240.30.158@tcp (at 10.240.30.158@tcp)
Lustre: DEBUG MARKER: onyx-130vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 3716:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1717593784/real 1717593786] req@000000002e85f9e8 x1801007592687424/t0(0) o400->lustre-MDT0000-mdc-ffff9cc7296c7800@10.240.30.158@tcp:12/10 lens 224/224 e 0 to 1 dl 1717593793 ref 1 fl Rpc:RXNQ/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 3716:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection to lustre-MDT0000 (at 10.240.30.158@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection restored to 10.240.30.158@tcp (at 10.240.30.158@tcp)
Autotest: Test running for 320 minutes (lustre-b_es6_0_full-part-2_650.120)
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-60vm4.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection to lustre-MDT0000 (at 10.240.30.158@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-MDT0000-mdc-ffff9cc7296c7800: Connection restored to 10.240.30.158@tcp (at 10.240.30.158@tcp)
Lustre: DEBUG MARKER: onyx-60vm4.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Link to test
sanity test 133f: Check reads/writes of client lustre proc files with bad area io
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34]
Modules linked in: dm_flakey obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr joydev i2c_piix4 virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk [last unloaded: dm_flakey]
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa43100753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000076051867 RBX: ffffca4f41d81440 RCX: 0000000000000200
RDX: 7fffffff89fae798 RSI: ffff94f2b6051000 RDI: ffff94f27e2f5000
RBP: 000055796c4f5000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff8 R12: ffff94f26bc477a8
R13: ffff94f2b5252800 R14: ffffca4f40f8bd40 R15: ffff94f245728d98
FS: 0000000000000000(0000) GS:ffff94f2ffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055d4db1322b8 CR3: 0000000072c10002 CR4: 00000000000606f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
Lustre: Failing over lustre-MDT0000
Lustre: lustre-MDT0000: Not available for connect from 10.240.23.65@tcp (stopping)
Lustre: lustre-MDT0000: Not available for connect from 10.240.23.64@tcp (stopping)
Lustre: lustre-MDT0000: Not available for connect from 10.240.23.66@tcp (stopping)
Lustre: lustre-MDT0000: Not available for connect from 10.240.23.65@tcp (stopping)
Lustre: Skipped 6 previous similar messages
Lustre: lustre-MDT0000: Not available for connect from 10.240.23.65@tcp (stopping)
Lustre: Skipped 9 previous similar messages
LustreError: 137-5: lustre-MDT0000: not available for connect from 10.240.23.66@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Autotest: Test running for 165 minutes (lustre-reviews_review-ldiskfs_103807.37)
Lustre: server umount lustre-MDT0000 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: dmsetup table /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: dmsetup remove /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1
Lustre: DEBUG MARKER: modprobe -r dm-flakey
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: test -b /dev/vg_Role_MDS/mdt1
Lustre: DEBUG MARKER: blockdev --getsz /dev/vg_Role_MDS/mdt1 2>/dev/null
Lustre: DEBUG MARKER: dmsetup create mds1_flakey --table "0 4194304 linear /dev/vg_Role_MDS/mdt1 0"
Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1
Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1
LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /sbin/lctl mark onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all
Lustre: DEBUG MARKER: /sbin/lctl mark onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all
Lustre: DEBUG MARKER: onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all
Lustre: DEBUG MARKER: onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all
Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null
Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
Lustre: lustre-OST0002-osc-MDT0000: update sequence from 0x2c0000bd2 to 0x2c0000bd3
Lustre: lustre-OST0005-osc-MDT0000: update sequence from 0x380000bd2 to 0x380000bd3
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-36vm2: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-36vm1: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: lustre-OST0003-osc-MDT0000: update sequence from 0x300000bd2 to 0x300000bd3
Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x280000bd2 to 0x280000bd3
Lustre: DEBUG MARKER: onyx-36vm2: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-36vm1: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
LustreError: 682462:0:(osp_precreate.c:670:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -5
LustreError: 682462:0:(osp_precreate.c:1374:osp_precreate_thread()) lustre-OST0000-osc-MDT0000: cannot precreate objects: rc = -5
Lustre: lustre-MDT0000: Not available for connect from 10.240.23.66@tcp (stopping)
Lustre: Skipped 2 previous similar messages
Lustre: server umount lustre-MDT0000 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: dmsetup table /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: dmsetup remove /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1
Lustre: DEBUG MARKER: modprobe -r dm-flakey
Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-mds1' ' /proc/mounts);
Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-mds1' ' /proc/mounts);
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: test -b /dev/vg_Role_MDS/mdt1
Lustre: DEBUG MARKER: blockdev --getsz /dev/vg_Role_MDS/mdt1 2>/dev/null
Lustre: DEBUG MARKER: dmsetup create mds1_flakey --table "0 4194304 linear /dev/vg_Role_MDS/mdt1 0"
Lustre: DEBUG MARKER: dmsetup mknodes >/dev/null 2>&1
Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1
LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: /sbin/lctl mark onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all
Lustre: DEBUG MARKER: /sbin/lctl mark onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all
Lustre: DEBUG MARKER: onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all
Lustre: DEBUG MARKER: onyx-36vm6.onyx.whamcloud.com: executing set_default_debug all all
Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null
Autotest: Test running for 170 minutes (lustre-reviews_review-ldiskfs_103807.37)
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon pcspkr i2c_piix4 joydev sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw virtio_blk net_failover failover
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa712c0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000010029867 RBX: ffffe8cd40400a40 RCX: 0000000000000200
RDX: 7fffffffeffd6798 RSI: ffff940950029000 RDI: ffff9409787b4000
RBP: 000056482d5b4000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffe8 R12: ffff940974be9da0
R13: ffff94098ee55000 R14: ffffe8cd40e1ed00 R15: ffff9409451e5828
FS: 0000000000000000(0000) GS:ffff9409ffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9af5b47024 CR3: 000000004c810005 CR4: 00000000000606f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Autotest: Test running for 20 minutes (lustre-reviews_review-dne-part-9_103593.18)
Autotest: Test running for 25 minutes (lustre-reviews_review-dne-part-9_103593.18)
Autotest: Test running for 30 minutes (lustre-reviews_review-dne-part-9_103593.18)
Autotest: Test running for 35 minutes (lustre-reviews_review-dne-part-9_103593.18)
Autotest: Test running for 40 minutes (lustre-reviews_review-dne-part-9_103593.18)
Autotest: Test running for 45 minutes (lustre-reviews_review-dne-part-9_103593.18)
Autotest: Test running for 50 minutes (lustre-reviews_review-dne-part-9_103593.18)
Autotest: Test running for 55 minutes (lustre-reviews_review-dne-part-9_103593.18)
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon joydev pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw virtio_blk failover
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffff9fd100753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000001dde0867 RBX: fffff20ec0777800 RCX: 0000000000000200
RDX: 7fffffffe221f798 RSI: ffff89579dde0000 RDI: ffff895798767000
RBP: 000056070bf67000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffea R12: ffff895783e9db38
R13: ffff895812255000 R14: fffff20ec061d9c0 R15: ffff8957859ed910
FS: 0000000000000000(0000) GS:ffff89583fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056070bf193e8 CR3: 000000008fc10003 CR4: 00000000000606f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.18.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon pcspkr joydev sunrpc ext4 mbcache jbd2 ata_generic ata_piix virtio_net libata crc32c_intel net_failover failover virtio_blk serio_raw
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.18.1.el8_9.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 cc cc cc cc 9d 30 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa67080753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000027570867 RBX: ffffebbf009d5c00 RCX: 0000000000000200
RDX: 7fffffffd8a8f798 RSI: ffff9477a7570000 RDI: ffff9477b93a4000
RBP: 000055bb0dba4000 R08: ffffebbf009d5848 R09: ffff94783ffcf000
R10: 0000000000000001 R11: ffffebbf00b50688 R12: ffff947783995d20
R13: ffff94780924d000 R14: ffffebbf00e4e900 R15: ffff9477855b2910
FS: 0000000000000000(0000) GS:ffff94783fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000558bdbccfa8c CR3: 0000000086c10003 CR4: 00000000001706f0
Call Trace:
<IRQ>
? watchdog_timer_fn.cold.10+0x46/0x9e
? watchdog+0x30/0x30
? __hrtimer_run_queues+0x101/0x280
? hrtimer_interrupt+0x100/0x220
? smp_apic_timer_interrupt+0x6a/0x130
? apic_timer_interrupt+0xf/0x20
</IRQ>
? copy_page+0x7/0x10
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.18.1.el8_9.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x11/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover failover serio_raw virtio_blk
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 a0 6e 41 00 9d 30 c0 e9 98 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 81 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffabef8074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000014a40867 RBX: ffffe20bc0529000 RCX: 0000000000000200
RDX: 7fffffffeb5bf798 RSI: ffff9dc0d4a40000 RDI: ffff9dc0ec2a5000
RBP: 0000557f40ea5000 R08: ffffe20bc0532e08 R09: ffff9dc17ffcf000
R10: 0000000000000000 R11: ffffe20bc053d188 R12: ffff9dc0e6080528
R13: ffff9dc17fc45000 R14: ffffe20bc0b0a940 R15: ffff9dc0e61fdd98
FS: 0000000000000000(0000) GS:ffff9dc17fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007eff93769100 CR3: 0000000029410005 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4 virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw net_failover virtio_blk failover [last unloaded: dm_flakey]
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffb05a4074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000043ee4867 RBX: ffffeb0dc10fb900 RCX: 0000000000000200
RDX: 7fffffffbc11b798 RSI: ffff9bba03ee4000 RDI: ffff9bba13fc4000
RBP: 000055fd8a7c4000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff1 R12: ffff9bb9e7e6fe20
R13: ffff9bba40a72800 R14: ffffeb0dc14ff100 R15: ffff9bba13295e80
FS: 0000000000000000(0000) GS:ffff9bba7fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbbaf370d38 CR3: 000000007f010004 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x8d7/0x1000
? mutex_lock+0xe/0x30
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa47180753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000028646865 RBX: fffffbcb80a19180 RCX: 0000000000000200
RDX: 7fffffffd79b979a RSI: ffff913528646000 RDI: ffff9135241f2000
RBP: 000055699e3f2000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff5 R12: ffff9135360daf90
R13: ffff91359b87d000 R14: fffffbcb80907c80 R15: ffff913505cd61d0
FS: 0000000000000000(0000) GS:ffff9135bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564dfb065018 CR3: 0000000099e10004 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Link to test
sanity test 17o: stat file with incompat LMA feature
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:34]
Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc dm_mod ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw virtio_blk net_failover failover [last unloaded: obdecho]
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffff9e35c0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000033766865 RBX: ffffd0dd80cdd980 RCX: 0000000000000200
RDX: 7fffffffcc89979a RSI: ffff91c1f3766000 RDI: ffff91c24fdad000
RBP: 0000561f7c9ad000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffef R12: ffff91c1dfdefd68
R13: ffff91c224052800 R14: ffffd0dd823f6b40 R15: ffff91c1c297bae0
FS: 0000000000000000(0000) GS:ffff91c27fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ab08a86368 CR3: 0000000062610004 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-95vm12.trevis.whamcloud.com: executing set_default_debug all all 4
Lustre: DEBUG MARKER: trevis-95vm12.trevis.whamcloud.com: executing set_default_debug all all 4
Lustre: lustre-OST0006: deleting orphan objects from 0x3c0000bd0:6 to 0x3c0000bd0:33
Lustre: lustre-OST0000: deleting orphan objects from 0x280000bd0:6 to 0x280000bd0:33
Lustre: lustre-OST0001: deleting orphan objects from 0x240000400:87846 to 0x240000400:87873
Lustre: lustre-OST0002: deleting orphan objects from 0x2c0000bd0:7 to 0x2c0000bd0:33
Lustre: lustre-OST0005: deleting orphan objects from 0x380000bd0:7 to 0x380000bd0:33
Lustre: lustre-OST0004: deleting orphan objects from 0x340000400:87943 to 0x340000400:87969
Lustre: 66605:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708071801/real 1708071801] req@ffff91c22bca96c0 x1791006021629312/t0(0) o400->MGC10.240.43.223@tcp@10.240.43.223@tcp:26/25 lens 224/224 e 0 to 1 dl 1708071817 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: 66605:0:(client.c:2337:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
LustreError: 166-1: MGC10.240.43.223@tcp: Connection to MGS (at 10.240.43.223@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0003: deleting orphan objects from 0x300000bd0:7 to 0x300000bd0:33
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.43.223@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 2 previous similar messages
Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to (at 10.240.43.223@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: lustre-MDT0000-lwp-OST0002: Connection restored to (at 10.240.43.223@tcp)
Lustre: Skipped 1 previous similar message
Lustre: Evicted from MGS (at 10.240.43.223@tcp) after server handle changed from 0xf9122bd54ccdc1e9 to 0xf9122bd54ccde7eb
Lustre: 66604:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708071810/real 1708071810] req@ffff91c20ba15a00 x1791006021629888/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.43.223@tcp:12/10 lens 224/224 e 0 to 1 dl 1708071826 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: 66604:0:(client.c:2337:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: 'ost_seq' is processing requests too slowly, client may timeout. Late by 6s, missed 1 early replies (reqs waiting=0 active=1, at_estimate=5, delay=11224ms)
Lustre: 1280748:0:(service.c:1397:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff91c1fd68b740 x1791024350124544/t0(0) o700->lustre-MDT0000-mdtlov_UUID@10.240.43.223@tcp:601/0 lens 264/248 e 0 to 0 dl 1708071831 ref 2 fl Interpret:/200/0 rc 0/0 job:'osp-pre-4-0.0' uid:0 gid:0
Lustre: lustre-OST0004: Client lustre-MDT0000-mdtlov_UUID (at 10.240.43.223@tcp) reconnecting
Lustre: Skipped 2 previous similar messages
Lustre: 66605:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708071843/real 1708071843] req@ffff91c1d20623c0 x1791006021633088/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.43.223@tcp:12/10 lens 224/224 e 0 to 1 dl 1708071859 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
Lustre: 66605:0:(client.c:2337:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.43.223@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 6 previous similar messages
LustreError: 166-1: MGC10.240.43.223@tcp: Connection to MGS (at 10.240.43.223@tcp) was lost; in progress operations using this service will fail
LustreError: 1280739:0:(mgc_request.c:619:do_requeue()) failed processing log: -5
Lustre: ll_ost_seq00_00: service thread pid 1280749 was inactive for 45.610 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 1280749, comm: ll_ost_seq00_00 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Wed Jan 3 18:54:51 UTC 2024
Call Trace TBD:
[<0>] cv_wait_common+0xaf/0x130 [spl]
[<0>] txg_wait_synced_impl+0xc6/0x110 [zfs]
[<0>] txg_wait_synced+0xc/0x40 [zfs]
[<0>] osd_trans_stop+0x510/0x550 [osd_zfs]
[<0>] seq_store_update+0x2ff/0x9c0 [fid]
[<0>] seq_server_check_and_alloc_super+0x96/0x2f0 [fid]
[<0>] seq_server_alloc_meta+0x66/0x650 [fid]
[<0>] seq_handler+0x590/0x5a0 [fid]
[<0>] tgt_request_handle+0x3f4/0x19a0 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x3ca/0xbd0 [ptlrpc]
[<0>] ptlrpc_main+0xc90/0x15b0 [ptlrpc]
[<0>] kthread+0x134/0x150
[<0>] ret_from_fork+0x35/0x40
Lustre: lustre-OST0006: haven't heard from client 2f61269b-4f35-4d53-b08c-097cfab6ac41 (at 10.240.43.204@tcp) in 32 seconds. I think it's dead, and I am evicting it. exp ffff91c217125000, cur 1708071882 expire 1708071852 last 1708071850
Lustre: lustre-OST0005: Client lustre-MDT0000-mdtlov_UUID (at 10.240.43.223@tcp) reconnecting
LustreError: 1280739:0:(mgc_request.c:619:do_requeue()) failed processing log: -5
Lustre: lustre-OST0005: Client lustre-MDT0000-mdtlov_UUID (at 10.240.43.223@tcp) reconnecting
Lustre: Skipped 2 previous similar messages
LNet: There was an unexpected network error while writing to 10.240.43.204: rc = -32
Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to (at 10.240.43.223@tcp)
Lustre: Skipped 5 previous similar messages
Lustre: lustre-OST0000: Client 2f61269b-4f35-4d53-b08c-097cfab6ac41 (at 10.240.43.204@tcp) reconnecting
Lustre: Skipped 2 previous similar messages
Lustre: ll_ost00_001: service thread pid 1280741 was inactive for 79.388 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 1280741, comm: ll_ost00_001 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Wed Jan 3 18:54:51 UTC 2024
Call Trace TBD:
[<0>] dbuf_read_done+0x19/0x100 [zfs]
[<0>] arc_read+0xb1e/0x1500 [zfs]
[<0>] dbuf_read_impl.constprop.32+0x26c/0x6b0 [zfs]
[<0>] dbuf_read+0x1b5/0x580 [zfs]
[<0>] dbuf_hold_impl+0x484/0x630 [zfs]
[<0>] dbuf_hold_level+0x2b/0x60 [zfs]
[<0>] dmu_tx_check_ioerr+0x32/0xd0 [zfs]
[<0>] dmu_tx_hold_zap_impl+0x70/0x80 [zfs]
[<0>] osd_declare_destroy+0x232/0x480 [osd_zfs]
[<0>] ofd_destroy+0x1ae/0xb50 [ofd]
[<0>] ofd_destroy_by_fid+0x25e/0x4a0 [ofd]
[<0>] ofd_orphans_destroy+0x248/0x910 [ofd]
[<0>] ofd_create_hdl+0x189a/0x19b0 [ofd]
[<0>] tgt_request_handle+0x3f4/0x19a0 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x3ca/0xbd0 [ptlrpc]
[<0>] ptlrpc_main+0xc90/0x15b0 [ptlrpc]
[<0>] kthread+0x134/0x150
[<0>] ret_from_fork+0x35/0x40
Lustre: ll_ost00_003: service thread pid 1281652 was inactive for 82.326 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 1281652, comm: ll_ost00_003 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Wed Jan 3 18:54:51 UTC 2024
Call Trace TBD:
[<0>] dbuf_read_done+0x19/0x100 [zfs]
[<0>] arc_read+0xb1e/0x1500 [zfs]
[<0>] dbuf_read_impl.constprop.32+0x26c/0x6b0 [zfs]
[<0>] dbuf_read+0x1b5/0x580 [zfs]
[<0>] dbuf_hold_impl+0x484/0x630 [zfs]
[<0>] dbuf_hold_level+0x2b/0x60 [zfs]
[<0>] dmu_tx_check_ioerr+0x32/0xd0 [zfs]
[<0>] dmu_tx_hold_zap_impl+0x70/0x80 [zfs]
[<0>] osd_declare_destroy+0x232/0x480 [osd_zfs]
[<0>] ofd_destroy+0x1ae/0xb50 [ofd]
[<0>] ofd_destroy_by_fid+0x25e/0x4a0 [ofd]
[<0>] ofd_orphans_destroy+0x248/0x910 [ofd]
[<0>] ofd_create_hdl+0x189a/0x19b0 [ofd]
[<0>] tgt_request_handle+0x3f4/0x19a0 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x3ca/0xbd0 [ptlrpc]
[<0>] ptlrpc_main+0xc90/0x15b0 [ptlrpc]
[<0>] kthread+0x134/0x150
[<0>] ret_from_fork+0x35/0x40
Lustre: ll_ost00_006: service thread pid 1281656 was inactive for 89.030 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Lustre: lustre-OST0004: Client lustre-MDT0000-mdtlov_UUID (at 10.240.43.223@tcp) reconnecting
Lustre: Skipped 5 previous similar messages
Lustre: 1280739:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708071910/real 1708071910] req@ffff91c24a16c9c0 x1791006021637824/t0(0) o503->MGC10.240.43.223@tcp@10.240.43.223@tcp:26/25 lens 272/8416 e 0 to 1 dl 1708071928 ref 2 fl Rpc:XQr/200/ffffffff rc 0/-1 job:'ll_cfg_requeue.0' uid:0 gid:0
Lustre: 1280739:0:(client.c:2337:ptlrpc_expire_one_request()) Skipped 21 previous similar messages
LustreError: 166-1: MGC10.240.43.223@tcp: Connection to MGS (at 10.240.43.223@tcp) was lost; in progress operations using this service will fail
LustreError: 1280739:0:(mgc_request.c:619:do_requeue()) failed processing log: -5
Lustre: MGC10.240.43.223@tcp: Connection restored to (at 10.240.43.223@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-95vm10.trevis.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
LustreError: 166-1: MGC10.240.43.223@tcp: Connection to MGS (at 10.240.43.223@tcp) was lost; in progress operations using this service will fail
LustreError: 1280739:0:(mgc_request.c:619:do_requeue()) failed processing log: -5
Lustre: MGC10.240.43.223@tcp: Connection restored to (at 10.240.43.223@tcp)
Lustre: DEBUG MARKER: trevis-95vm10.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Link to test
sanity test 259: crash at delayed truncate
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:34]
Modules linked in: dm_flakey osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover failover virtio_blk [last unloaded: obdecho]
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G W OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb67100753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000085b46865 RBX: fffff7cf4216d180 RCX: 0000000000000200
RDX: 7fffffff7a4b979a RSI: ffff898b45b46000 RDI: ffff898adef00000
RBP: 000055a63b500000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffed R12: ffff898afda68800
R13: ffff898b37462800 R14: fffff7cf407bc000 R15: ffff898afd91e3a0
FS: 0000000000000000(0000) GS:ffff898b7fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3cf7d37008 CR3: 0000000075a10002 CR4: 00000000001706e0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G W OEL --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-*.*OST0000.kbytesfree
Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-*.*OST0000.kbytesfree
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x2301
Lustre: *** cfs_fail_loc=2301, val=0***
Lustre: Skipped 3 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-*.*OST0000.kbytesfree
Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Lustre: DEBUG MARKER: umount -d /mnt/lustre-ost1
Lustre: Failing over lustre-OST0000
LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 2 previous similar messages
Lustre: server umount lustre-OST0000 complete
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 4 previous similar messages
Lustre: DEBUG MARKER: modprobe dm-flakey;
LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 4 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1
Lustre: DEBUG MARKER: modprobe dm-flakey;
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1
Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey 2>&1
Lustre: DEBUG MARKER: test -b /dev/mapper/ost1_flakey
Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey
LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 9 previous similar messages
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1; mount -t lustre -o localrecov /dev/mapper/ost1_flakey /mnt/lustre-ost1
LDISKFS-fs (dm-9): 1 truncate cleaned up
LDISKFS-fs (dm-9): recovery complete
LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.240.26.116@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 19 previous similar messages
Lustre: lustre-OST0000: Imperative Recovery not enabled, recovery window 60-180
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0000: in recovery but waiting for the first client to connect
Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 5 clients reconnect
Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param seq.cli-lustre-OST0000-super.width=16384
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: lustre-OST0000: Recovery over after 0:03, of 5 clients 5 recovered and 0 were evicted.
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Link to test
hot-pools test 8: lamigo: start with debug (-b) command line option
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34]
Modules linked in: ib_core mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sunrpc pcspkr joydev virtio_balloon i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net serio_raw net_failover failover virtio_blk
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 a0 6e 41 00 9d 30 c0 e9 98 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 81 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffff9df68074bd18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000002b419867 RBX: fffff887c0ad0640 RCX: 0000000000000200
RDX: 7fffffffd4be6798 RSI: ffff89f66b419000 RDI: ffff89f6799c6000
RBP: 000055ada3bc6000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffef R12: ffff89f666218e30
R13: ffff89f6ffc4d000 R14: fffff887c0e67180 R15: ffff89f6660c39f8
FS: 0000000000000000(0000) GS:ffff89f6ffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555a7b084e58 CR3: 0000000050a10001 CR4: 00000000001706f0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-72vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-72vm10.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-72vm3.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-72vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-72vm10.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-72vm9.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Link to test
sanityn test 43k: unlink vs create
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:34]
Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover failover virtio_blk [last unloaded: dm_flakey]
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa5ef00753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 0000000034b0d867 RBX: ffffce81c0d2c340 RCX: 0000000000000200
RDX: 7fffffffcb4f2798 RSI: ffff912574b0d000 RDI: ffff9125816ce000
RBP: 000055697e4ce000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffe8 R12: ffff91256ca0a670
R13: ffff9125ba24d000 R14: ffffce81c105b380 R15: ffff91254597a2b8
FS: 0000000000000000(0000) GS:ffff9125ffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000558b87975f44 CR3: 0000000078810001 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping
LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking
LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
LustreError: 388722:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping
LustreError: 385063:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking
LustreError: 388722:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
LustreError: 388722:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping
LustreError: 385061:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking
LustreError: 388722:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
LustreError: 385061:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping
LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking
LustreError: 385061:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4498
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
LustreError: 385063:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping
LustreError: 385061:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking
LustreError: 385063:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping
LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) Skipped 2 previous similar messages
LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking
LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) Skipped 2 previous similar messages
LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497
LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) Skipped 2 previous similar messages
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
LustreError: 410181:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping
LustreError: 410181:0:(libcfs_fail.h:169:cfs_race()) Skipped 5 previous similar messages
LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking
LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) Skipped 5 previous similar messages
LustreError: 410181:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4498
LustreError: 410181:0:(libcfs_fail.h:178:cfs_race()) Skipped 5 previous similar messages
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
LustreError: 410181:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping
LustreError: 410181:0:(libcfs_fail.h:169:cfs_race()) Skipped 9 previous similar messages
LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking
LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) Skipped 9 previous similar messages
LustreError: 410181:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4497
LustreError: 410181:0:(libcfs_fail.h:178:cfs_race()) Skipped 9 previous similar messages
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
LustreError: 388722:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 169 sleeping
LustreError: 388722:0:(libcfs_fail.h:169:cfs_race()) Skipped 21 previous similar messages
LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 169 waking
LustreError: 385062:0:(libcfs_fail.h:180:cfs_race()) Skipped 21 previous similar messages
LustreError: 388722:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 169 awake: rc=4498
LustreError: 388722:0:(libcfs_fail.h:178:cfs_race()) Skipped 21 previous similar messages
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x80000169 || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 16a sleeping
LustreError: 385062:0:(libcfs_fail.h:169:cfs_race()) Skipped 40 previous similar messages
LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 16a waking
LustreError: 388722:0:(libcfs_fail.h:180:cfs_race()) Skipped 40 previous similar messages
LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 16a awake: rc=4497
LustreError: 385062:0:(libcfs_fail.h:178:cfs_race()) Skipped 40 previous similar messages
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x0 || true
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0x8000016a || true
Lustre: 11314:0:(client.c:2321:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1700728104/real 1700728104] req@00000000d5582155 x1783327576143296/t0(0) o13->lustre-OST0003-osc-MDT0003@10.240.43.143@tcp:7/4 lens 224/368 e 0 to 1 dl 1700728111 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-3-3.0'
Lustre: 11314:0:(client.c:2321:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
Lustre: lustre-OST0003-osc-MDT0003: Connection to lustre-OST0003 (at 10.240.43.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
LustreError: 166-1: MGC10.240.43.26@tcp: Connection to MGS (at 10.240.43.26@tcp) was lost; in progress operations using this service will fail
Link to test
obdfilter-survey test 1c: Object Storage Targets survey, big batch
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:33]
Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk [last unloaded: dm_flakey]
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.19.2.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 c0 6e 22 00 9d 30 c0 e9 b8 6e 22 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 a1 6e 22 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa30b4074bd10 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000002a6d9867 RBX: ffffc47c40a9b640 RCX: 0000000000000200
RDX: 0000000000000000 RSI: ffff8c6faa6d9000 RDI: ffff8c6fd4261000
RBP: 000055c38aa61000 R08: ffffffffffffffff R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff6 R12: ffff8c6f82cb7308
R13: ffff8c6f81cfa800 R14: ffffc47c41509840 R15: ffff8c6f833a9910
FS: 0000000000000000(0000) GS:ffff8c703fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f7d6d555ac CR3: 0000000075010005 CR4: 00000000000606e0
Call Trace:
collapse_huge_page+0x8e4/0x1010
? mutex_lock+0xe/0x30
khugepaged+0xed0/0x11e0
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x10b/0x130
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-425.19.2.el8_lustre.ddn17.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: 10904:0:(client.c:2321:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1699645028/real 1699645028] req@00000000a9eabe2f x1782190199880704/t0(0) o13->lustre-OST0001-osc-MDT0003@10.240.42.238@tcp:7/4 lens 224/368 e 0 to 1 dl 1699645035 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-1-3.0'
Lustre: 10904:0:(client.c:2321:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-OST0001-osc-MDT0003: Connection to lustre-OST0001 (at 10.240.42.238@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 3 previous similar messages
LustreError: 166-1: MGC10.240.42.241@tcp: Connection to MGS (at 10.240.42.241@tcp) was lost; in progress operations using this service will fail
LustreError: Skipped 4 previous similar messages
Link to test
sanity test 127c: test llite extent stats with regular
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:34]
Modules linked in: obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4 virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk [last unloaded: llog_test]
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.21.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 90 6e 41 00 9d 30 c0 e9 88 6e 41 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 71 6e 41 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffaf7fc0753d18 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000004f864867 RBX: ffffd8f3c13e1900 RCX: 0000000000000200
RDX: 7fffffffb079b798 RSI: ffff98d7cf864000 RDI: ffff98d7e4aa6000
RBP: 000055a9896a6000 R08: 00000000000396d0 R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffffc R12: ffff98d7ea293530
R13: ffff98d795848000 R14: ffffd8f3c192a980 R15: ffff98d79ed59000
FS: 0000000000000000(0000) GS:ffff98d83fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a98a276000 CR3: 0000000013e10005 CR4: 00000000001706e0
Call Trace:
collapse_huge_page+0x8d7/0x1000
khugepaged+0xed9/0x11e0
? __schedule+0x2d9/0x870
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x134/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 34 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-477.21.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: 9981:0:(client.c:2310:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1695922608/real 1695922608] req@00000000068ea7bb x1778290835190784/t0(0) o13->lustre-OST0004-osc-MDT0000@10.240.29.153@tcp:7/4 lens 224/368 e 0 to 1 dl 1695922624 ref 1 fl Rpc:XQr/200/ffffffff rc 0/-1 job:'osp-pre-4-0.0' uid:0 gid:0
Lustre: lustre-OST0004-osc-MDT0000: Connection to lustre-OST0004 (at 10.240.29.153@tcp) was lost; in progress operations using this service will wait for recovery to complete
Link to test
replay-single test 70b: dbench 4mdts recovery; 2 clients
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:33]
Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel net_failover serio_raw failover virtio_blk
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-348.7.1.el8_5.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa42f4074bd20 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000002d2d5825 RBX: ffffea8740b4b540 RCX: 0000000000000200
RDX: 0000000000000000 RSI: ffff9089ad2d5000 RDI: ffff9089b7ad5000
RBP: 0000557c626d5000 R08: ffffffffffffffe9 R09: 0000000000000088
R10: ffffffffffffffff R11: 00000000fffffffe R12: ffff9089829156a8
R13: ffff908a0f4617c0 R14: ffffea8740deb540 R15: ffff908985c67910
FS: 0000000000000000(0000) GS:ffff908a3fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8408f587bc CR3: 000000008de10006 CR4: 00000000000606e0
Call Trace:
collapse_huge_page+0x914/0xff0
khugepaged+0xecc/0x11d0
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x116/0x130
? kthread_flush_work_fn+0x10/0x10
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-348.7.1.el8_5.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x5c/0x80
panic+0xe7/0x2a9
? __switch_to_asm+0x51/0x70
watchdog_timer_fn.cold.9+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x100/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
Lustre: DEBUG MARKER:
Lustre: DEBUG MARKER: mount | grep /mnt/lustre' '
Lustre: DEBUG MARKER: set -x; MISSING_DBENCH_OK= PATH=/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests/:/usr/share/doc/dbench/loadfiles DBENCH_LIB=/usr/share/doc/dbench/loadfiles
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started rundbench load pid=213562 ...
Lustre: DEBUG MARKER: Started rundbench load pid=213562 ...
Lustre: DEBUG MARKER: killall -0 dbench
Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
Lustre: 7987:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1672887608/real 1672887608] req@000000007c21fefb x1754142100258752/t0(0) o400->MGC10.240.24.91@tcp@10.240.24.91@tcp:26/25 lens 224/224 e 0 to 1 dl 1672887616 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
LustreError: 166-1: MGC10.240.24.91@tcp: Connection to MGS (at 10.240.24.91@tcp) was lost; in progress operations using this service will fail
LustreError: Skipped 2 previous similar messages
Lustre: lustre-MDT0002-mdc-ffff908982aab000: Connection to lustre-MDT0002 (at 10.240.24.91@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 2 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70b fail mds1 1 times
Link to test
recovery-mds-scale test failover_mds: failover MDS
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:33]
Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sd_mod t10_pi sg iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel virtio_blk serio_raw net_failover failover
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-372.32.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 c0 fb 23 00 9d 30 c0 e9 b8 fb 23 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 a1 fb 23 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffa9bf8074bd10 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000000d04e867 RBX: ffffd5ad80341380 RCX: 0000000000000200
RDX: 0000000000000000 RSI: ffff91790d04e000 RDI: ffff917900284000
RBP: 000055cc59c84000 R08: ffffffffffffffff R09: 0000000000000011
R10: 0000000000000007 R11: 00000000fffffff7 R12: ffff91794df36420
R13: ffff91797446d000 R14: ffffd5ad8000a100 R15: ffff917905c2d9f8
FS: 0000000000000000(0000) GS:ffff9179bfd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000559bb46731ac CR3: 0000000072a10006 CR4: 00000000000606e0
Call Trace:
collapse_huge_page+0x8e4/0x1010
khugepaged+0xecf/0x11e0
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x10a/0x120
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-372.32.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x70
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x100/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on onyx-42vm8
Lustre: DEBUG MARKER: Started client load: dd on onyx-42vm8
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on onyx-42vm9
Lustre: DEBUG MARKER: Started client load: tar on onyx-42vm9
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670956314/real 1670956314] req@0000000083751d84 x1752124496070976/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670956321 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0004: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 2 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0000: Received new MDS connection from 10.240.23.150@tcp, remove former export from same NID
Lustre: Skipped 2 previous similar messages
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670956332/real 1670956350] req@00000000474257a5 x1752124496073536/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670956376 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x2c450038ab2a1d7b to 0xc9cf0cbefe9ea54f
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 1 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 1 times, and counting...
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670956314/real 1670956314] req@00000000125812a3 x1752124496071040/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670956358 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 29 previous similar messages
Lustre: lustre-OST0005: deleting orphan objects from 0x0:49 to 0x0:65
Lustre: lustre-OST0002: deleting orphan objects from 0x0:49 to 0x0:65
Lustre: lustre-OST0006: deleting orphan objects from 0x0:49 to 0x0:65
Lustre: lustre-OST0001: deleting orphan objects from 0x0:49 to 0x0:65
Lustre: lustre-OST0004: deleting orphan objects from 0x0:48 to 0x0:65
Lustre: lustre-OST0003: deleting orphan objects from 0x0:48 to 0x0:65
Lustre: lustre-OST0000: deleting orphan objects from 0x0:49 to 0x0:65
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670956319/real 1670956319] req@00000000be1be6fa x1752124496071552/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670956363 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670956326/real 1670956326] req@000000002933c035 x1752124496072512/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670956370 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=62 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=62 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670957518/real 1670957518] req@00000000e46e732b x1752124496455872/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670957525 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 2 previous similar messages
Lustre: lustre-OST0002: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 2 previous similar messages
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670957537/real 1670957556] req@000000005325c217 x1752124496458496/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670957583 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 6 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670957542/real 1670957560] req@00000000ddc349d1 x1752124496458944/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670957588 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: 36949:0:(service.c:2345:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/1s); client may timeout req@00000000592c448f x1752125968220800/t0(0) o8-><?>@<unknown>:0/0 lens 520/416 e 0 to 0 dl 1670957560 ref 1 fl Complete:/0/0 rc -114/-114 job:''
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670957518/real 1670957518] req@0000000071ac08ab x1752124496455936/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670957564 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0xc9cf0cbefe9ea54f to 0x3359c7c17f295d00
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 2 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 2 times, and counting...
Lustre: lustre-OST0004: deleting orphan objects from 0x0:600 to 0x0:705
Lustre: lustre-OST0003: deleting orphan objects from 0x0:720 to 0x0:833
Lustre: lustre-OST0000: deleting orphan objects from 0x0:636 to 0x0:705
Lustre: lustre-OST0002: deleting orphan objects from 0x0:632 to 0x0:705
Lustre: lustre-OST0005: deleting orphan objects from 0x0:578 to 0x0:641
Lustre: lustre-OST0001: deleting orphan objects from 0x0:680 to 0x0:769
Lustre: lustre-OST0006: deleting orphan objects from 0x0:823 to 0x0:961
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670957530/real 1670957530] req@00000000f157610b x1752124496457600/t0(0) o400->lustre-MDT0000-lwp-OST0003@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670957576 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 44 previous similar messages
Lustre: lustre-OST0000: Export 00000000515a5654 already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0000: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting
Lustre: lustre-OST0001: Export 00000000756ae6ba already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0001: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting
Lustre: lustre-OST0002: Export 00000000780f6cbc already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0002: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting
Lustre: lustre-OST0004: Export 00000000934f6a39 already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0004: Export 00000000934f6a39 already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0004: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=1276 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=1276 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670958715/real 1670958715] req@000000000a5c192c x1752124496890688/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670958722 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 3 previous similar messages
Lustre: lustre-OST0004: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 3 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670958744/real 1670958754] req@00000000046c334d x1752124496894272/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670958788 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0002: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 7 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x3359c7c17f295d00 to 0x72c4ab34c3498060
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0001: deleting orphan objects from 0x0:1429 to 0x0:1473
Lustre: lustre-OST0002: deleting orphan objects from 0x0:1667 to 0x0:1729
Lustre: lustre-OST0005: deleting orphan objects from 0x0:1360 to 0x0:1441
Lustre: lustre-OST0003: deleting orphan objects from 0x0:1509 to 0x0:1633
Lustre: lustre-OST0004: deleting orphan objects from 0x0:1341 to 0x0:1505
Lustre: lustre-OST0000: deleting orphan objects from 0x0:1269 to 0x0:1345
Lustre: lustre-OST0006: deleting orphan objects from 0x0:1604 to 0x0:1697
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670958715/real 1670958715] req@000000006a500430 x1752124496890816/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670958759 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 30 previous similar messages
Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 3 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 3 times, and counting...
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670958727/real 1670958727] req@00000000722f25f5 x1752124496892288/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670958771 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
connection1:0: detected conn error (1020)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=2470 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=2470 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670959921/real 1670959921] req@00000000d1c8ecc0 x1752124497328000/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670959928 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 2 previous similar messages
Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0002: Export 0000000093c4dca7 already connecting from 10.240.23.149@tcp
Lustre: Skipped 1 previous similar message
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670959944/real 1670959963] req@0000000005137350 x1752124497331328/t0(0) o400->lustre-MDT0000-lwp-OST0005@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670959990 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: lustre-MDT0000-lwp-OST0005: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 5 previous similar messages
Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0002: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670959921/real 1670959921] req@000000005e086b00 x1752124497328064/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670959967 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 4 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 4 times, and counting...
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x72c4ab34c3498060 to 0xd62eb1972448fc10
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: lustre-OST0002: deleting orphan objects from 0x0:2334 to 0x0:2465
Lustre: lustre-OST0004: deleting orphan objects from 0x0:2117 to 0x0:2209
Lustre: lustre-OST0001: deleting orphan objects from 0x0:2127 to 0x0:2209
Lustre: lustre-OST0000: deleting orphan objects from 0x0:2002 to 0x0:2081
Lustre: lustre-OST0003: deleting orphan objects from 0x0:2255 to 0x0:2369
Lustre: lustre-OST0006: deleting orphan objects from 0x0:2301 to 0x0:2433
Lustre: lustre-OST0005: deleting orphan objects from 0x0:2100 to 0x0:2177
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670959933/real 1670959933] req@000000002010db62 x1752124497329536/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670959979 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 51 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=3678 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=3678 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670961120/real 1670961120] req@000000000711473d x1752124497763264/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670961127 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 4 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0003: deleting orphan objects from 0x0:3132 to 0x0:3169
Lustre: lustre-OST0001: deleting orphan objects from 0x0:3000 to 0x0:3137
Lustre: lustre-OST0000: deleting orphan objects from 0x0:2846 to 0x0:2945
Lustre: lustre-OST0006: deleting orphan objects from 0x0:3299 to 0x0:3361
Lustre: lustre-OST0002: deleting orphan objects from 0x0:3241 to 0x0:3361
Lustre: lustre-OST0005: deleting orphan objects from 0x0:2943 to 0x0:3041
Lustre: lustre-OST0004: deleting orphan objects from 0x0:3006 to 0x0:3073
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670961144/real 1670961158] req@00000000a2b3621e x1752124497766784/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670961191 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0006: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 6 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xd62eb1972448fc10 to 0x20d5d1f7f970463d
Lustre: lustre-MDT0000-lwp-OST0001: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 5 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 5 times, and counting...
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670961120/real 1670961120] req@0000000079b128b7 x1752124497763328/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670961167 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 26 previous similar messages
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670961127/real 1670961127] req@000000009d89b971 x1752124497764544/t0(0) o400->lustre-MDT0000-lwp-OST0003@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670961174 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=4868 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=4868 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670962326/real 1670962326] req@00000000dd29ada6 x1752124498303488/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670962333 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 23 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670962344/real 1670962363] req@00000000f1d2af20 x1752124498306112/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670962390 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 5 previous similar messages
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670962349/real 1670962367] req@00000000a7771c63 x1752124498306560/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670962395 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 6 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 6 times, and counting...
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670962349/real 1670962376] req@00000000015da59a x1752124498306496/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670962395 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x20d5d1f7f970463d to 0x90e0347dd307f316
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: lustre-OST0000: deleting orphan objects from 0x0:3844 to 0x0:3937
Lustre: lustre-OST0001: deleting orphan objects from 0x0:3942 to 0x0:4129
Lustre: lustre-OST0005: deleting orphan objects from 0x0:3904 to 0x0:4033
Lustre: lustre-OST0004: deleting orphan objects from 0x0:3961 to 0x0:4065
Lustre: lustre-OST0003: deleting orphan objects from 0x0:4037 to 0x0:4161
Lustre: lustre-OST0002: deleting orphan objects from 0x0:4119 to 0x0:4289
Lustre: lustre-OST0006: deleting orphan objects from 0x0:4177 to 0x0:4353
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=6079 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=6079 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670963518/real 1670963518] req@000000009ec0395c x1752124498738240/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670963525 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 45 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0001: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670963536/real 1670963554] req@000000002d0d1c97 x1752124498740800/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670963580 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 10 previous similar messages
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670963541/real 1670963558] req@000000005937a913 x1752124498741248/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670963585 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 7 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 7 times, and counting...
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670963523/real 1670963523] req@00000000793b7852 x1752124498739136/t0(0) o400->lustre-MDT0000-lwp-OST0005@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670963567 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x90e0347dd307f316 to 0x5d59d660893718a7
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 8 previous similar messages
Lustre: lustre-OST0002: deleting orphan objects from 0x0:5156 to 0x0:5345
Lustre: lustre-OST0001: deleting orphan objects from 0x0:4919 to 0x0:4993
Lustre: lustre-OST0003: deleting orphan objects from 0x0:4968 to 0x0:5089
Lustre: lustre-OST0000: deleting orphan objects from 0x0:4675 to 0x0:4833
Lustre: lustre-OST0004: deleting orphan objects from 0x0:4868 to 0x0:5025
Lustre: lustre-OST0005: deleting orphan objects from 0x0:4804 to 0x0:4929
Lustre: lustre-OST0006: deleting orphan objects from 0x0:5130 to 0x0:5281
Lustre: ll_ost_io00_002: service thread pid 16621 was inactive for 62.657 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 16621, comm: ll_ost_io00_002 4.18.0-372.32.1.el8_lustre.x86_64 #1 SMP Thu Oct 27 18:54:42 UTC 2022
Call Trace TBD:
[<0>] cv_wait_common+0xaf/0x130 [spl]
[<0>] txg_wait_synced_impl+0xc6/0x110 [zfs]
[<0>] txg_wait_synced+0xc/0x40 [zfs]
[<0>] dmu_tx_wait+0x377/0x390 [zfs]
[<0>] dmu_tx_assign+0x157/0x470 [zfs]
[<0>] osd_trans_start+0x1b7/0x430 [osd_zfs]
[<0>] ofd_write_attr_set+0x11d/0x1070 [ofd]
[<0>] ofd_commitrw_write+0x205/0x1a70 [ofd]
[<0>] ofd_commitrw+0x5f0/0xd70 [ofd]
[<0>] obd_commitrw+0x1b0/0x380 [ptlrpc]
[<0>] tgt_brw_write+0x153f/0x1ad0 [ptlrpc]
[<0>] tgt_request_handle+0xc90/0x19c0 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
[<0>] kthread+0x10a/0x120
[<0>] ret_from_fork+0x35/0x40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=7270 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=7270 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670964730/real 1670964730] req@00000000a2c96647 x1752124499072128/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670964737 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 54 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 5 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670964749/real 1670964768] req@000000000a86c8c0 x1752124499074752/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670964793 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 3 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670964754/real 1670964772] req@000000003f5caa9f x1752124499075136/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670964798 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 8 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 8 times, and counting...
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670964764/real 1670964780] req@0000000070fe4b84 x1752124499076288/t0(0) o400->lustre-MDT0000-lwp-OST0003@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670964808 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x5d59d660893718a7 to 0x3ab30235b82c90f5
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0001: deleting orphan objects from 0x0:5382 to 0x0:5441
Lustre: lustre-OST0000: deleting orphan objects from 0x0:5201 to 0x0:5281
Lustre: lustre-OST0002: deleting orphan objects from 0x0:5865 to 0x0:6049
Lustre: lustre-OST0006: deleting orphan objects from 0x0:5650 to 0x0:5729
Lustre: lustre-OST0005: deleting orphan objects from 0x0:5325 to 0x0:5377
Lustre: lustre-OST0004: deleting orphan objects from 0x0:5418 to 0x0:5473
Lustre: lustre-OST0003: deleting orphan objects from 0x0:5458 to 0x0:5537
Lustre: lustre-OST0006: Export 000000006ed2958b already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0006: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=8485 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=8485 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670965927/real 1670965927] req@00000000bae788b6 x1752124499506496/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670965934 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 30 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670965927/real 1670965927] req@00000000a6417e9c x1752124499506560/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670965938 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 5 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670965932/real 1670965932] req@000000000d5229a5 x1752124499507072/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670965943 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x3ab30235b82c90f5 to 0x1cd08a9d1032c7de
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: lustre-OST0003: deleting orphan objects from 0x0:6166 to 0x0:6241
Lustre: lustre-OST0004: deleting orphan objects from 0x0:6098 to 0x0:6177
Lustre: lustre-OST0002: deleting orphan objects from 0x0:6759 to 0x0:6945
Lustre: lustre-OST0001: deleting orphan objects from 0x0:6133 to 0x0:6273
Lustre: lustre-OST0000: deleting orphan objects from 0x0:5961 to 0x0:6145
Lustre: lustre-OST0005: deleting orphan objects from 0x0:6068 to 0x0:6145
Lustre: lustre-OST0006: deleting orphan objects from 0x0:6413 to 0x0:6593
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 9 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 9 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=9680 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=9680 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670967134/real 1670967134] req@000000002c0c3d68 x1752124499890560/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670967141 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 5 previous similar messages
Lustre: lustre-OST0001: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0005: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0004: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0000: Export 00000000e9d71d58 already connecting from 10.240.23.149@tcp
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670967157/real 1670967168] req@00000000ba54c862 x1752124499893568/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670967203 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 6 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x1cd08a9d1032c7de to 0xdd76e30dbe9412d8
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0000: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID
Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670967134/real 1670967134] req@00000000f34c6d2d x1752124499890624/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670967180 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 26 previous similar messages
Lustre: lustre-OST0000: deleting orphan objects from 0x0:6766 to 0x0:6881
Lustre: lustre-OST0001: deleting orphan objects from 0x0:6920 to 0x0:7009
Lustre: lustre-OST0005: deleting orphan objects from 0x0:6764 to 0x0:6881
Lustre: lustre-OST0004: deleting orphan objects from 0x0:6797 to 0x0:6913
Lustre: lustre-OST0002: deleting orphan objects from 0x0:7558 to 0x0:7681
Lustre: lustre-OST0006: deleting orphan objects from 0x0:7233 to 0x0:7329
Lustre: lustre-OST0003: deleting orphan objects from 0x0:6851 to 0x0:6913
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670967140/real 1670967140] req@00000000d763aa97 x1752124499891136/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670967186 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 10 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 10 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=10898 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=10898 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670968325/real 1670968325] req@000000001ac96f64 x1752124500429312/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670968332 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 27 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670968325/real 1670968325] req@0000000015518c45 x1752124500429440/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670968333 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670968325/real 1670968325] req@0000000051ff97cd x1752124500429376/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670968333 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 6 previous similar messages
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670968330/real 1670968330] req@000000008da5aff1 x1752124500429952/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670968338 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 5 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xdd76e30dbe9412d8 to 0x7bf8ea38d6048ee1
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0003: deleting orphan objects from 0x0:7739 to 0x0:7777
Lustre: lustre-OST0002: deleting orphan objects from 0x0:8565 to 0x0:8705
Lustre: lustre-OST0001: deleting orphan objects from 0x0:7848 to 0x0:7937
Lustre: lustre-OST0000: deleting orphan objects from 0x0:7761 to 0x0:7841
Lustre: lustre-OST0006: deleting orphan objects from 0x0:8336 to 0x0:8417
Lustre: lustre-OST0005: deleting orphan objects from 0x0:7862 to 0x0:7969
Lustre: lustre-OST0004: deleting orphan objects from 0x0:7898 to 0x0:8001
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 11 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 11 times, and counting...
Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: lustre-OST0003: Export 0000000042a16b27 already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0003: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting
Lustre: lustre-OST0003: Export 00000000ece6f824 already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0003: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=12075 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=12075 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670969523/real 1670969523] req@00000000f12ca16d x1752124500966912/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670969530 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670969523/real 1670969523] req@00000000725694ff x1752124500967360/t0(0) o400->lustre-MDT0000-lwp-OST0006@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670969531 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 2 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670969528/real 1670969528] req@00000000da258bd4 x1752124500967872/t0(0) o400->lustre-MDT0000-lwp-OST0006@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670969536 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670969530/real 1670969530] req@00000000ca109534 x1752124500968064/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670969538 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0004: Export 000000007d9dcb9e already connecting from 10.240.23.149@tcp
Lustre: Skipped 1 previous similar message
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x7bf8ea38d6048ee1 to 0xcfa98585f718d913
Lustre: lustre-OST0004: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID
Lustre: Skipped 1 previous similar message
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0000: deleting orphan objects from 0x0:8783 to 0x0:8865
Lustre: lustre-OST0004: deleting orphan objects from 0x0:8881 to 0x0:8961
Lustre: lustre-OST0003: deleting orphan objects from 0x0:8495 to 0x0:8577
Lustre: lustre-OST0002: deleting orphan objects from 0x0:9651 to 0x0:9729
Lustre: lustre-OST0001: deleting orphan objects from 0x0:8790 to 0x0:8833
Lustre: lustre-OST0005: deleting orphan objects from 0x0:8799 to 0x0:8865
Lustre: lustre-OST0006: deleting orphan objects from 0x0:9300 to 0x0:9377
Lustre: lustre-MDT0000-lwp-OST0005: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 12 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 12 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=13287 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=13287 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670970730/real 1670970730] req@000000000be1824f x1752124501454208/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670970737 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Lustre: Skipped 10 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670970735/real 1670970735] req@000000000ac2f95f x1752124501454784/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670970742 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0006: deleting orphan objects from 0x0:10249 to 0x0:10337
Lustre: lustre-OST0005: deleting orphan objects from 0x0:9711 to 0x0:9761
Lustre: lustre-OST0004: deleting orphan objects from 0x0:9770 to 0x0:9825
Lustre: lustre-OST0003: deleting orphan objects from 0x0:9325 to 0x0:9377
Lustre: lustre-OST0002: deleting orphan objects from 0x0:10549 to 0x0:10625
Lustre: lustre-OST0001: deleting orphan objects from 0x0:9614 to 0x0:9793
Lustre: lustre-OST0000: deleting orphan objects from 0x0:9668 to 0x0:9761
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xcfa98585f718d913 to 0x47d893cda4104e2b
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 13 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 13 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=14480 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=14480 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670971922/real 1670971922] req@000000006d76acd4 x1752124501990784/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670971929 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670971940/real 1670971959] req@00000000e33304d6 x1752124501993536/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971986 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0004: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 2 previous similar messages
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670971945/real 1670971963] req@0000000003fa5e40 x1752124501994304/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971991 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670971922/real 1670971922] req@000000001fee5a31 x1752124501990848/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971968 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670971945/real 1670971972] req@00000000085fcee2 x1752124501994240/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971991 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
Lustre: lustre-MDT0000-lwp-OST0001: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x47d893cda4104e2b to 0x613ab883dcd550ef
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: lustre-OST0001: deleting orphan objects from 0x0:10766 to 0x0:10913
Lustre: lustre-OST0000: deleting orphan objects from 0x0:10727 to 0x0:10913
Lustre: lustre-OST0006: deleting orphan objects from 0x0:11315 to 0x0:11457
Lustre: lustre-OST0004: deleting orphan objects from 0x0:10792 to 0x0:10945
Lustre: lustre-OST0003: deleting orphan objects from 0x0:10327 to 0x0:10401
Lustre: lustre-OST0002: deleting orphan objects from 0x0:11588 to 0x0:11745
Lustre: lustre-OST0005: deleting orphan objects from 0x0:10727 to 0x0:10881
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 14 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 14 times, and counting...
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670971934/real 1670971934] req@000000006cb6e88b x1752124501992512/t0(0) o400->lustre-MDT0000-lwp-OST0003@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670971980 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 31 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=15683 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=15683 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670973124/real 1670973124] req@0000000015a5718d x1752124502632832/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670973131 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: Skipped 2 previous similar messages
Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: Skipped 1 previous similar message
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670973142/real 1670973158] req@00000000d3667447 x1752124502635456/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670973188 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 3 previous similar messages
Lustre: lustre-OST0003: Export 0000000098559874 already connecting from 10.240.23.150@tcp
Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0003: Received new MDS connection from 10.240.23.150@tcp, remove former export from same NID
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670973147/real 1670973167] req@000000004ecfde8b x1752124502635840/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670973193 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 15 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 15 times, and counting...
Lustre: lustre-MDT0000-lwp-OST0005: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x613ab883dcd550ef to 0x73949b8b1e7497e6
Lustre: lustre-OST0000: deleting orphan objects from 0x0:12039 to 0x0:12225
Lustre: lustre-OST0003: deleting orphan objects from 0x0:11439 to 0x0:11521
Lustre: lustre-OST0001: deleting orphan objects from 0x0:12065 to 0x0:12097
Lustre: lustre-OST0005: deleting orphan objects from 0x0:11988 to 0x0:12065
Lustre: lustre-OST0006: deleting orphan objects from 0x0:12547 to 0x0:12641
Lustre: lustre-OST0004: deleting orphan objects from 0x0:12017 to 0x0:12129
Lustre: lustre-OST0002: deleting orphan objects from 0x0:12958 to 0x0:13185
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670973129/real 1670973129] req@000000005911a6a9 x1752124502633472/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670973175 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 27 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=16874 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=16874 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670974326/real 1670974326] req@000000002d4ed59b x1752124503017152/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670974333 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 27 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 6 previous similar messages
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670974331/real 1670974331] req@000000002648476d x1752124503017728/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670974339 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 4 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0004: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x73949b8b1e7497e6 to 0x71520c8d34a640c9
Lustre: lustre-OST0002: deleting orphan objects from 0x0:13756 to 0x0:13825
Lustre: lustre-OST0005: deleting orphan objects from 0x0:12625 to 0x0:12705
Lustre: lustre-OST0000: deleting orphan objects from 0x0:12837 to 0x0:13025
Lustre: lustre-OST0001: deleting orphan objects from 0x0:12657 to 0x0:12737
Lustre: lustre-OST0003: deleting orphan objects from 0x0:12136 to 0x0:12321
Lustre: lustre-OST0004: deleting orphan objects from 0x0:12713 to 0x0:12801
Lustre: lustre-OST0006: deleting orphan objects from 0x0:13211 to 0x0:13281
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 16 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 16 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=18078 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=18078 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975528/real 1670975528] req@00000000ff299ae7 x1752124503555072/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670975535 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670975551/real 1670975566] req@00000000343cd71e x1752124503558080/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975595 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 6 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x71520c8d34a640c9 to 0xc7e5116db274ea8e
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 8 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 17 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 17 times, and counting...
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975528/real 1670975528] req@0000000011efd632 x1752124503555136/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975572 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 26 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975533/real 1670975533] req@000000009aa7600f x1752124503555712/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975577 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
Lustre: lustre-OST0003: deleting orphan objects from 0x0:13284 to 0x0:13409
Lustre: lustre-OST0002: deleting orphan objects from 0x0:14736 to 0x0:14849
Lustre: lustre-OST0004: deleting orphan objects from 0x0:13716 to 0x0:13825
Lustre: lustre-OST0001: deleting orphan objects from 0x0:13646 to 0x0:13761
Lustre: lustre-OST0005: deleting orphan objects from 0x0:13663 to 0x0:13729
Lustre: lustre-OST0006: deleting orphan objects from 0x0:14190 to 0x0:14305
Lustre: lustre-OST0000: deleting orphan objects from 0x0:13937 to 0x0:14049
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975541/real 1670975541] req@000000006e4b0e09 x1752124503556672/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975585 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670975541/real 1670975541] req@00000000898b2c84 x1752124503556608/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670975585 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=19277 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=19277 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670976733/real 1670976733] req@000000003b58e812 x1752124503992384/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670976740 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670976757/real 1670976771] req@00000000c84ce29b x1752124503995968/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670976801 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 2 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0xc7e5116db274ea8e to 0x2b7f3f160f5c0574
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670976733/real 1670976733] req@0000000049adc9e8 x1752124503992448/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670976777 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 26 previous similar messages
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 18 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 18 times, and counting...
Lustre: lustre-OST0000: deleting orphan objects from 0x0:14776 to 0x0:14913
Lustre: lustre-OST0001: deleting orphan objects from 0x0:14505 to 0x0:14689
Lustre: lustre-OST0002: deleting orphan objects from 0x0:15587 to 0x0:15649
Lustre: lustre-OST0003: deleting orphan objects from 0x0:14147 to 0x0:14337
Lustre: lustre-OST0006: deleting orphan objects from 0x0:15023 to 0x0:15105
Lustre: lustre-OST0004: deleting orphan objects from 0x0:14504 to 0x0:14625
Lustre: lustre-OST0005: deleting orphan objects from 0x0:14463 to 0x0:14529
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670976745/real 1670976745] req@000000008f3c2279 x1752124503993920/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670976789 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=20486 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=20486 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670977934/real 1670977934] req@000000008e512e0c x1752124504463424/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670977941 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 5 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670977952/real 1670977965] req@00000000598f46e2 x1752124504465984/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670977998 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: lustre-MDT0000-lwp-OST0005: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 5 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0x2b7f3f160f5c0574 to 0xcb7fda250d948c28
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 19 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 19 times, and counting...
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670977934/real 1670977934] req@000000001c9c8956 x1752124504463552/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670977980 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 30 previous similar messages
Lustre: lustre-OST0006: deleting orphan objects from 0x0:15805 to 0x0:15873
Lustre: lustre-OST0005: deleting orphan objects from 0x0:15294 to 0x0:15361
Lustre: lustre-OST0004: deleting orphan objects from 0x0:15374 to 0x0:15457
Lustre: lustre-OST0003: deleting orphan objects from 0x0:15114 to 0x0:15297
Lustre: lustre-OST0002: deleting orphan objects from 0x0:16442 to 0x0:16577
Lustre: lustre-OST0001: deleting orphan objects from 0x0:15553 to 0x0:15649
Lustre: lustre-OST0000: deleting orphan objects from 0x0:15710 to 0x0:15841
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670977946/real 1670977946] req@0000000049d48248 x1752124504464960/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670977992 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=21681 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=21681 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670979128/real 1670979128] req@000000000608e7be x1752124505000640/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670979135 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0001: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0006: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670979146/real 1670979163] req@000000008658bda6 x1752124505003200/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670979190 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670979151/real 1670979167] req@00000000822c813a x1752124505003712/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670979195 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 20 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 20 times, and counting...
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670979133/real 1670979133] req@0000000065f1bd72 x1752124505001216/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670979177 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0xcb7fda250d948c28 to 0xf3e6320145a82e53
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: lustre-OST0004: deleting orphan objects from 0x0:16346 to 0x0:16385
Lustre: lustre-OST0003: deleting orphan objects from 0x0:16149 to 0x0:16225
Lustre: lustre-OST0002: deleting orphan objects from 0x0:17420 to 0x0:17505
Lustre: lustre-OST0001: deleting orphan objects from 0x0:16505 to 0x0:16577
Lustre: lustre-OST0000: deleting orphan objects from 0x0:16757 to 0x0:16833
Lustre: lustre-OST0006: deleting orphan objects from 0x0:16958 to 0x0:17313
Lustre: lustre-OST0005: deleting orphan objects from 0x0:16397 to 0x0:16481
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=22882 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=22882 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670980335/real 1670980335] req@000000007b1f7887 x1752124505507200/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670980342 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 54 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 7 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0000: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0004: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: Skipped 1 previous similar message
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670980354/real 1670980370] req@00000000debfcc42 x1752124505510208/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670980398 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 9 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xf3e6320145a82e53 to 0x51496d62371084ed
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 5 previous similar messages
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670980335/real 1670980335] req@000000009743021d x1752124505507264/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670980379 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: lustre-OST0006: deleting orphan objects from 0x0:18075 to 0x0:18113
Lustre: lustre-OST0005: deleting orphan objects from 0x0:17273 to 0x0:17441
Lustre: lustre-OST0004: deleting orphan objects from 0x0:17137 to 0x0:17281
Lustre: lustre-OST0003: deleting orphan objects from 0x0:17108 to 0x0:17281
Lustre: lustre-OST0002: deleting orphan objects from 0x0:18328 to 0x0:18465
Lustre: lustre-OST0001: deleting orphan objects from 0x0:17555 to 0x0:17633
Lustre: lustre-OST0000: deleting orphan objects from 0x0:17522 to 0x0:17569
Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670980347/real 1670980347] req@0000000006c0d807 x1752124505508736/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670980391 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 21 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 21 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=24105 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=24105 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670981528/real 1670981528] req@00000000bb3ac64d x1752124505942528/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670981535 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 5 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0006: Export 0000000031b1e33a already connecting from 10.240.23.149@tcp
Lustre: lustre-OST0006: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: Skipped 1 previous similar message
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670981551/real 1670981567] req@000000004e5f1da8 x1752124505945664/t0(0) o400->lustre-MDT0000-lwp-OST0002@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670981595 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 3 previous similar messages
Lustre: lustre-OST0006: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670981552/real 1670981572] req@00000000628678a3 x1752124505946112/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670981596 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 22 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 22 times, and counting...
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670981552/real 1670981583] req@000000008d6c1aa2 x1752124505946048/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670981596 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 23 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0x51496d62371084ed to 0xf280a8013c3b1c17
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0000: deleting orphan objects from 0x0:18430 to 0x0:18785
Lustre: lustre-OST0001: deleting orphan objects from 0x0:18400 to 0x0:18529
Lustre: lustre-OST0002: deleting orphan objects from 0x0:19375 to 0x0:19553
Lustre: lustre-OST0006: deleting orphan objects from 0x0:18820 to 0x0:18913
Lustre: lustre-OST0005: deleting orphan objects from 0x0:18192 to 0x0:18337
Lustre: lustre-OST0003: deleting orphan objects from 0x0:18059 to 0x0:18113
Lustre: lustre-OST0004: deleting orphan objects from 0x0:18047 to 0x0:18177
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=25286 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=25286 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670982745/real 1670982745] req@00000000e07b6f6e x1752124506482624/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670982752 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 37 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 4 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670982763/real 1670982782] req@00000000a3489656 x1752124506485248/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670982809 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 6 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670982768/real 1670982786] req@00000000868e7051 x1752124506485696/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670982814 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670982768/real 1670982795] req@00000000fbf2880a x1752124506485632/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670982814 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xf280a8013c3b1c17 to 0xa30b6b9752b5a750
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 23 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 23 times, and counting...
Lustre: lustre-OST0000: deleting orphan objects from 0x0:19723 to 0x0:19841
Lustre: lustre-OST0001: deleting orphan objects from 0x0:19392 to 0x0:19457
Lustre: lustre-OST0002: deleting orphan objects from 0x0:20521 to 0x0:20641
Lustre: lustre-OST0004: deleting orphan objects from 0x0:19019 to 0x0:19073
Lustre: lustre-OST0003: deleting orphan objects from 0x0:18949 to 0x0:19009
Lustre: lustre-OST0005: deleting orphan objects from 0x0:19178 to 0x0:19233
Lustre: lustre-OST0006: deleting orphan objects from 0x0:19765 to 0x0:19809
Lustre: lustre-OST0006: Export 00000000a3f31d99 already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0006: Export 00000000a3f31d99 already connecting from 10.240.23.192@tcp
Lustre: lustre-OST0006: Client b06f5f2d-9b16-4775-97bc-28084a0929d5 (at 10.240.23.192@tcp) reconnecting
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=26502 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=26502 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670983931/real 1670983931] req@00000000369cbb00 x1752124506967424/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.150@tcp:26/25 lens 224/224 e 0 to 1 dl 1670983938 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.150@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 3 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670983936/real 1670983936] req@00000000063d8a0d x1752124506968000/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670983944 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.149@tcp, removing former export from 10.240.23.150@tcp
Lustre: Skipped 8 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0002: Export 00000000c5a99897 already connecting from 10.240.23.149@tcp
Lustre: lustre-OST0002: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: Evicted from MGS (at 10.240.23.149@tcp) after server handle changed from 0xa30b6b9752b5a750 to 0xccf1f9ba37039d5f
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.149@tcp (at 10.240.23.149@tcp)
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0002: Received new MDS connection from 10.240.23.149@tcp, remove former export from same NID
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 24 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 24 times, and counting...
Lustre: lustre-OST0006: deleting orphan objects from 0x0:20707 to 0x0:20769
Lustre: lustre-OST0005: deleting orphan objects from 0x0:20110 to 0x0:20193
Lustre: lustre-OST0004: deleting orphan objects from 0x0:19929 to 0x0:20097
Lustre: lustre-OST0003: deleting orphan objects from 0x0:19751 to 0x0:19809
Lustre: lustre-OST0002: deleting orphan objects from 0x0:21531 to 0x0:21601
Lustre: lustre-OST0001: deleting orphan objects from 0x0:20303 to 0x0:20353
Lustre: lustre-OST0000: deleting orphan objects from 0x0:20711 to 0x0:20801
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=27686 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Check clients loads BEFORE failover -- failure NOT OK ELAPSED=27686 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: Wait mds1 recovery complete before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete \*.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: onyx-40vm6.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985215/real 1670985215] req@00000000f41150b5 x1752124507410624/t0(0) o400->MGC10.240.23.149@tcp@10.240.23.149@tcp:26/25 lens 224/224 e 0 to 1 dl 1670985222 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
LustreError: 166-1: MGC10.240.23.149@tcp: Connection to MGS (at 10.240.23.149@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-OST0000: Received MDS connection from 10.240.23.150@tcp, removing former export from 10.240.23.149@tcp
Lustre: Skipped 4 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-OST0003: Export 00000000a6a55e7f already connecting from 10.240.23.150@tcp
Lustre: lustre-OST0003: denying duplicate export for lustre-MDT0000-mdtlov_UUID: rc = -114
Lustre: lustre-OST0003: Received new MDS connection from 10.240.23.150@tcp, remove former export from same NID
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1670985238/real 1670985253] req@00000000e757c834 x1752124507413184/t0(0) o400->lustre-MDT0000-lwp-OST0000@10.240.23.149@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985284 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.23.149@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Skipped 6 previous similar messages
Lustre: Evicted from MGS (at 10.240.23.150@tcp) after server handle changed from 0xccf1f9ba37039d5f to 0xe4b6b0058cd8841
Lustre: MGC10.240.23.149@tcp: Connection restored to 10.240.23.150@tcp (at 10.240.23.150@tcp)
Lustre: Skipped 7 previous similar messages
Lustre: DEBUG MARKER: onyx-40vm7.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985215/real 1670985215] req@00000000226bcee0 x1752124507410816/t0(0) o400->lustre-MDT0000-lwp-OST0002@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985261 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985215/real 1670985215] req@00000000166701ca x1752124507410752/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985261 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 failed over 25 times, and counting...
Lustre: DEBUG MARKER: mds1 failed over 25 times, and counting...
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985220/real 1670985220] req@0000000076d79b33 x1752124507411456/t0(0) o400->lustre-MDT0000-lwp-OST0004@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985266 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670985220/real 1670985220] req@0000000085e5ac9f x1752124507411264/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.240.23.150@tcp:12/10 lens 224/224 e 0 to 1 dl 1670985266 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 16508:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: 16507:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-OST0005: deleting orphan objects from 0x0:20860 to 0x0:20897
Lustre: lustre-OST0004: deleting orphan objects from 0x0:20763 to 0x0:20833
Lustre: lustre-OST0003: deleting orphan objects from 0x0:20475 to 0x0:20545
Lustre: lustre-OST0000: deleting orphan objects from 0x0:21462 to 0x0:21505
Lustre: lustre-OST0001: deleting orphan objects from 0x0:21019 to 0x0:21089
Lustre: lustre-OST0006: deleting orphan objects from 0x0:21437 to 0x0:21505
Lustre: lustre-OST0002: deleting orphan objects from 0x0:22261 to 0x0:22305
Lustre: lustre-OST0003: Client lustre-MDT0000-mdtlov_UUID (at 10.240.23.150@tcp) reconnecting
Link to test
sanity-benchmark test iozone: iozone
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:33]
Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon joydev pcspkr ext4 ata_generic mbcache jbd2 ata_piix libata virtio_net crc32c_intel serio_raw virtio_blk net_failover failover
CPU: 0 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.3.1.el8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: 75 11 65 48 89 1e 65 48 89 4e 08 9d b0 01 e9 f0 6e 22 00 9d 30 c0 e9 e8 6e 22 00 90 90 90 90 90 90 90 90 66 90 b9 00 02 00 00 <f3> 48 a5 e9 d1 6e 22 00 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffac218074bd10 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000002ddd4867 RBX: ffffcca240b77500 RCX: 0000000000000200
RDX: 0000000000000000 RSI: ffff96526ddd4000 RDI: ffff965269da0000
RBP: 000055a1e43a0000 R08: ffffffffffffffff R09: 0000000000000011
R10: 0000000000000007 R11: 00000000ffffffee R12: ffff965268f56d00
R13: ffff9652a0e55000 R14: ffffcca240a76800 R15: ffff96524363bd98
FS: 0000000000000000(0000) GS:ffff9652ffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056216f8d4008 CR3: 000000005f410006 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x8e4/0x1010
khugepaged+0xed0/0x11e0
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x10b/0x130
? set_kthread_struct+0x50/0x50
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-425.3.1.el8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x41/0x60
panic+0xe7/0x2ac
? __switch_to_asm+0x51/0x80
watchdog_timer_fn.cold.10+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x101/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1705984kB available, using 5448816kB file size
Lustre: DEBUG MARKER: min OST has 1705984kB available, using 5448816kB file size
Link to test
conf-sanity test 21b: start ost before mds, stop mds first
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [khugepaged:33]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc dm_mod intel_rapl_msr i2c_piix4 virtio_balloon joydev intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw virtio_blk net_failover failover
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: P OE --------- - - 4.18.0-348.23.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffff9bc24074bd20 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000003d173865 RBX: ffffc5bc80f45cc0 RCX: 0000000000000200
RDX: 0000000000000000 RSI: ffff8a6cfd173000 RDI: ffff8a6cfe076000
RBP: 0000562787276000 R08: ffffffffffffffe7 R09: 0000000000000088
R10: ffffffffffffffff R11: 00000000fffffffa R12: ffff8a6cedd953b0
R13: ffff8a6d22470000 R14: ffffc5bc80f81d80 R15: ffff8a6cef502488
FS: 0000000000000000(0000) GS:ffff8a6d7fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5b96935f10 CR3: 0000000060e10002 CR4: 00000000000606e0
Call Trace:
collapse_huge_page+0x914/0xff0
khugepaged+0xecc/0x11d0
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x116/0x130
? kthread_flush_work_fn+0x10/0x10
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: P OEL --------- - - 4.18.0-348.23.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x5c/0x80
panic+0xe7/0x2a9
? __switch_to_asm+0x51/0x70
watchdog_timer_fn.cold.9+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x100/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-39vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: onyx-39vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]\*.ost_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]\*.ost_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: onyx-37vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds2
Lustre: DEBUG MARKER: lsmod | grep zfs >&/dev/null || modprobe zfs;
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt2/mdt2
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds2; mount -t lustre -o localrecov lustre-mdt2/mdt2 /mnt/lustre-mds2
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt2/mdt2 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt2/mdt2 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: onyx-37vm8.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds4
Lustre: DEBUG MARKER: lsmod | grep zfs >&/dev/null || modprobe zfs;
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt4/mdt4
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds4; mount -t lustre -o localrecov lustre-mdt4/mdt4 /mnt/lustre-mds4
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing set_default_debug -1 all 4
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt4/mdt4 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt4/mdt4 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0002-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-\*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm8.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: onyx-42vm9.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0003-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm8.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid 40
Lustre: DEBUG MARKER: onyx-37vm8.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_min
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40
Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40
Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm8.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid 40
Lustre: DEBUG MARKER: onyx-37vm8.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0002.ost_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_min
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 40
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 40
Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 40
Lustre: DEBUG MARKER: onyx-37vm9.onyx.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0003.ost_server_uuid 40
Link to test
performance-sanity test 8: getattr large files
watchdog: BUG: soft lockup - CPU#1 stuck for 21s! [khugepaged:33]
Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel virtio_balloon pcspkr joydev i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover serio_raw failover virtio_blk [last unloaded: dm_flakey]
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-348.23.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0000:ffffb5964074bd20 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 000000003d187827 RBX: fffffaef40f461c0 RCX: 0000000000000200
RDX: 0000000000000000 RSI: ffff8a3a3d187000 RDI: ffff8a3a7d9a2000
RBP: 00007fe2b23a2000 R08: ffffffffffffffe9 R09: 0000000000000088
R10: ffffffffffffffff R11: 0000000000000000 R12: ffff8a3a38f88d10
R13: ffff8a3abfd65f00 R14: fffffaef41f66880 R15: ffff8a3a5771d570
FS: 0000000000000000(0000) GS:ffff8a3abfd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe2b325f000 CR3: 000000000ca10006 CR4: 00000000001706e0
Call Trace:
collapse_huge_page+0x914/0xff0
khugepaged+0xecc/0x11d0
? finish_wait+0x80/0x80
? collapse_pte_mapped_thp+0x430/0x430
kthread+0x116/0x130
? kthread_flush_work_fn+0x10/0x10
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 33 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-348.23.1.el8_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x5c/0x80
panic+0xe7/0x2a9
? __switch_to_asm+0x51/0x70
watchdog_timer_fn.cold.9+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x100/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-stat-large.sh ======
Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh ======
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm15.onyx.whamcloud.com: executing check_config_client \/mnt\/lustre
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm16.onyx.whamcloud.com: executing check_config_client \/mnt\/lustre
Lustre: DEBUG MARKER: onyx-73vm15.onyx.whamcloud.com: executing check_config_client /mnt/lustre
Lustre: DEBUG MARKER: onyx-73vm16.onyx.whamcloud.com: executing check_config_client /mnt/lustre
Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-mds2' ' /proc/mounts);
Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-mds4' ' /proc/mounts);
Lustre: DEBUG MARKER: e2label /dev/mapper/mds2_flakey 2>/dev/null
Lustre: DEBUG MARKER: cat /proc/mounts
Lustre: DEBUG MARKER: e2label /dev/mapper/mds4_flakey 2>/dev/null
Lustre: DEBUG MARKER: cat /proc/mounts
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: [ -f /sys/module/mgc/parameters/mgc_requeue_timeout_min ] && echo 1 > /sys/module/mgc/parameters/mgc_requeue_timeout_min; exit 0
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm16.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm17.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-73vm16.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm18.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-73vm17.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-73vm18.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm19.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-73vm19.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-73vm19.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: onyx-73vm19.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param osd-ldiskfs.track_declares_assert=1 || true
Lustre: DEBUG MARKER: params=$(/usr/sbin/lctl get_param mdt.*.enable_remote_dir_gid);
&& param= ||
&& param="$params";
Lustre: DEBUG MARKER: params=$(/usr/sbin/lctl get_param mdt.*.enable_remote_dir_gid);
&& param= ||
&& param="$params";
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.*.enable_remote_dir_gid=-1
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-stat-large.sh Test preparation: creating 598434 files.
Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh Test preparation: creating 598434 files.
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-stat-large.sh ### 1 NODE STAT ###
Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-stat-large.sh ### 2 NODES STAT ###
Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh
Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774709/real 1661774709] req@000000007a3774d9 x1742478279705984/t0(0) o13->lustre-OST0007-osc-MDT0003@10.240.26.52@tcp:7/4 lens 224/368 e 0 to 1 dl 1661774719 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-7-3.0'
Lustre: lustre-OST0007-osc-MDT0003: Connection to lustre-OST0007 (at 10.240.26.52@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774715/real 1661774715] req@0000000084e9e95a x1742478279706176/t0(0) o13->lustre-OST0001-osc-MDT0003@10.240.26.52@tcp:7/4 lens 224/368 e 0 to 1 dl 1661774725 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-1-3.0'
Lustre: lustre-OST0001-osc-MDT0003: Connection to lustre-OST0001 (at 10.240.26.52@tcp) was lost; in progress operations using this service will wait for recovery to complete
LNetError: 10482:0:(socklnd.c:1531:ksocknal_destroy_conn()) Incomplete receive of lnet header from 12345-10.240.26.52@tcp, ip 10.240.26.52:1023, with error, protocol: 3.x.
Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774723/real 1661774723] req@00000000a7c123c5 x1742478279706304/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.240.26.53@tcp:24/4 lens 224/224 e 0 to 1 dl 1661774733 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 10.240.26.53@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 2 previous similar messages
LustreError: 166-1: MGC10.240.26.53@tcp: Connection to MGS (at 10.240.26.53@tcp) was lost; in progress operations using this service will fail
Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774726/real 1661774726] req@000000000f212a47 x1742478279707968/t0(0) o400->lustre-MDT0002-osp-MDT0001@10.240.26.53@tcp:24/4 lens 224/224 e 0 to 1 dl 1661774736 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774731/real 1661774731] req@00000000fbf46214 x1742478279709376/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.240.26.53@tcp:24/4 lens 224/224 e 0 to 1 dl 1661774741 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
LNetError: 10479:0:(socklnd_cb.c:1779:ksocknal_recv_hello()) Error -104 reading HELLO from 10.240.26.52
LNetError: 11b-b: Connection to 10.240.26.52@tcp at host 10.240.26.52:7988 was reset: is it running a compatible version of Lustre and is 10.240.26.52@tcp one of its NIDs?
Lustre: lustre-MDT0001: Received new MDS connection from 10.240.26.53@tcp, keep former export from same NID
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0001: Client lustre-MDT0001-lwp-OST0000_UUID (at 10.240.26.52@tcp) reconnecting
Lustre: lustre-MDT0003: Client lustre-MDT0003-lwp-OST0000_UUID (at 10.240.26.52@tcp) reconnecting
Lustre: lustre-MDT0001: Client lustre-MDT0001-lwp-OST0001_UUID (at 10.240.26.52@tcp) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774746/real 1661774746] req@0000000091fa10c7 x1742478279715200/t0(0) o400->lustre-MDT0003-osp-MDT0001@0@lo:24/4 lens 224/224 e 0 to 1 dl 1661774757 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 10490:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 28 previous similar messages
Lustre: lustre-MDT0003-osp-MDT0001: Connection to lustre-MDT0003 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 15 previous similar messages
Lustre: 25229:0:(service.c:2156:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 8s req@00000000da6ac881 x1742478279715904/t0(0) o41->lustre-MDT0001-mdtlov_UUID@0@lo:0/0 lens 224/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 job:'osp-pre-3-1.0'
Lustre: mdt_out: This server is not able to keep up with request traffic (cpu-bound).
Lustre: 25229:0:(service.c:1612:ptlrpc_at_check_timed()) earlyQ=4 reqQ=0 recA=2, svcEst=2, delay=8216ms
Lustre: 25229:0:(service.c:1378:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-1s), not sending early reply. Consider increasing at_early_margin (5)? req@000000009109ac18 x1742478279715840/t0(0) o41->lustre-MDT0003-mdtlov_UUID@0@lo:127/0 lens 224/0 e 0 to 0 dl 1661774757 ref 2 fl New:/0/ffffffff rc 0/-1 job:'osp-pre-1-3.0'
Lustre: lustre-MDT0000-osp-MDT0001: Connection restored to 10.240.26.53@tcp (at 10.240.26.53@tcp)
Lustre: lustre-MDT0003: Client lustre-MDT0003-lwp-OST0002_UUID (at 10.240.26.52@tcp) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: MGC10.240.26.53@tcp: Connection restored to 10.240.26.53@tcp (at 10.240.26.53@tcp)
Lustre: lustre-MDT0001: Received new MDS connection from 10.240.26.53@tcp, keep former export from same NID
Lustre: lustre-MDT0002-osp-MDT0003: Connection restored to 10.240.26.53@tcp (at 10.240.26.53@tcp)
Lustre: Skipped 14 previous similar messages
LustreError: 51051:0:(service.c:2289:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.240.26.53@tcp: deadline 20/1s ago req@000000008c60fa42 x1742478259713600/t0(0) o38->lustre-MDT0002-mdtlov_UUID@10.240.26.53@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774766 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 51051:0:(service.c:2327:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/2s); client may timeout req@000000008c60fa42 x1742478259713600/t0(0) o38->lustre-MDT0002-mdtlov_UUID@10.240.26.53@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774766 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-lwp-MDT0001: Connection to lustre-MDT0000 (at 10.240.26.53@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
LustreError: 51051:0:(service.c:2289:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.240.26.53@tcp: deadline 20/4s ago req@00000000886cba67 x1742478259721728/t0(0) o38->lustre-MDT0000-mdtlov_UUID@10.240.26.53@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774766 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 51051:0:(service.c:2327:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/4s); client may timeout req@00000000886cba67 x1742478259721728/t0(0) o38->lustre-MDT0000-mdtlov_UUID@10.240.26.53@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774766 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0003: Received new MDS connection from 0@lo, keep former export from same NID
Lustre: 25225:0:(service.c:2327:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/4s); client may timeout req@0000000027a924f0 x1742478057310272/t0(0) o38->lustre-MDT0003-lwp-OST0002_UUID@10.240.26.52@tcp:0/0 lens 520/416 e 0 to 0 dl 1661774771 ref 1 fl Complete:H/0/0 rc 0/0 job:'kworker/u4:4.0'
LustreError: 28602:0:(service.c:2289:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.240.26.52@tcp: deadline 20/4s ago req@0000000073732b89 x1742478057310336/t0(0) o38->lustre-MDT0001-lwp-OST0003_UUID@10.240.26.52@tcp:0/0 lens 520/0 e 0 to 0 dl 1661774771 ref 1 fl Interpret:H/0/ffffffff rc 0/-1 job:'kworker/u4:4.0'
Lustre: 25225:0:(service.c:2327:ptlrpc_server_handle_request()) Skipped 8 previous similar messages
LustreError: 28602:0:(service.c:2289:ptlrpc_server_handle_request()) Skipped 7 previous similar messages
Lustre: lustre-MDT0001: Client c712ec83-3787-4382-8825-3be32b653b63 (at 10.240.26.50@tcp) reconnecting
Lustre: lustre-MDT0001: Received new MDS connection from 0@lo, keep former export from same NID
Lustre: lustre-MDT0000-lwp-MDT0001: Connection restored to 10.240.26.53@tcp (at 10.240.26.53@tcp)
Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1661774726/real 1661774726] req@000000001ef74094 x1742478279708608/t0(0) o400->lustre-MDT0000-lwp-MDT0001@10.240.26.53@tcp:12/10 lens 224/224 e 0 to 1 dl 1661774773 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: Skipped 4 previous similar messages
Lustre: 10489:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Lustre: lustre-MDT0001: Export 00000000ed4f2d78 already connecting from 10.240.26.53@tcp
Lustre: lustre-MDT0001: Export 000000001c79cc3f already connecting from 10.240.26.52@tcp
Lustre: Skipped 4 previous similar messages
Lustre: lustre-MDT0003: Export 000000000e911a8c already connecting from 10.240.26.51@tcp
Lustre: Skipped 2 previous similar messages
Lustre: 25226:0:(service.c:2327:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/8s); client may timeout req@000000003272a9da x1742478057309952/t0(0) o38->lustre-MDT0003-lwp-OST0000_UUID@10.240.26.52@tcp:0/0 lens 520/416 e 0 to 0 dl 1661774771 ref 1 fl Complete:H/0/0 rc 0/0 job:'kworker/u4:4.0'
Lustre: 25226:0:(service.c:2327:ptlrpc_server_handle_request()) Skipped 7 previous similar messages
Lustre: lustre-MDT0001: Received new MDS connection from 10.240.26.53@tcp, keep former export from same NID
Lustre: Skipped 3 previous similar messages
Lustre: lustre-MDT0003-osp-MDT0001: Connection restored to (at 0@lo)
Lustre: Skipped 2 previous similar messages
Lustre: lustre-MDT0001: Client lustre-MDT0001-lwp-OST0001_UUID (at 10.240.26.52@tcp) reconnecting
Lustre: Skipped 22 previous similar messages
Link to test
recovery-mds-scale test failover_mds: failover MDS
watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [khugepaged:31]
Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover virtio_blk failover
CPU: 0 PID: 31 Comm: khugepaged Kdump: loaded Tainted: G OE --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:copy_page+0x7/0x10
Code: ff c3 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 66 90 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 44 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
RSP: 0018:ffffb5c780733d48 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
RAX: 00000000346ee867 RBX: ffffe3ff808cdd40 RCX: 0000000000000200
RDX: 7fffffffcb911798 RSI: ffff9cf5346ee000 RDI: ffff9cf523375000
RBP: 0000556d77f75000 R08: ffffe3ff80e1d418 R09: ffff9cf5bffd0000
R10: 00000000000305c0 R11: ffffffffffffffe8 R12: ffffe3ff80d1bb80
R13: ffff9cf53f5e9ba8 R14: ffff9cf5bfd54740 R15: ffff9cf5933b9ae0
FS: 0000000000000000(0000) GS:ffff9cf5bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000587000 CR3: 000000005400a006 CR4: 00000000000606f0
Call Trace:
collapse_huge_page+0x6b6/0xf10
khugepaged+0xb5b/0x1150
? finish_wait+0x80/0x80
? collapse_huge_page+0xf10/0xf10
kthread+0x112/0x130
? kthread_flush_work_fn+0x10/0x10
ret_from_fork+0x35/0x40
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 31 Comm: khugepaged Kdump: loaded Tainted: G OEL --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x5c/0x80
panic+0xe7/0x2a9
? __switch_to_asm+0x51/0x70
watchdog_timer_fn.cold.8+0x85/0x9e
? watchdog+0x30/0x30
__hrtimer_run_queues+0x100/0x280
hrtimer_interrupt+0x100/0x220
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:copy_page+0x7/0x10
Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bi
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on trevis-78vm3
Lustre: DEBUG MARKER: Started client load: dd on trevis-78vm3
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on trevis-78vm4
Lustre: DEBUG MARKER: Started client load: tar on trevis-78vm4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dbench on trevis-78vm5
Lustre: DEBUG MARKER: Started client load: dbench on trevis-78vm5
Lustre: DEBUG MARKER: cat /tmp/client-load.pid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641907486/real 1641907489] req@00000000c6d2f1fa x1721664603565184/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641907493 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xd9cf9bdad7eabcc7 to 0x2c0667baa92fef98
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000009fd8dce5 x1721664595309952/t4294967308(4294967308) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/600 e 0 to 0 dl 1641907567 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 1 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting...
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000008a5c099e x1721664595310144/t4294967302(4294967302) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641907639 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641907466/real 1641907466] req@000000004175aae9 x1721664599615616/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641907473 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641907461/real 1641907461] req@00000000207787dc x1721664599397696/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641907468 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=177 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=177 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641908681/real 1641908699] req@00000000d19744ea x1721664843220928/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908725 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641908687/real 1641908703] req@00000000f17e3dff x1721664843639808/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908731 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641908692/real 1641908707] req@0000000047f02f32 x1721664843647168/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908736 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641908697/real 1641908714] req@000000002f64aea5 x1721664843979008/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908741 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000eb58fa83 x1721664813324544/t8590036363(8590036363) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641908834 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641908676/real 1641908676] req@0000000072114042 x1721664843109632/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641908720 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 1 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 1 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=1384 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=1384 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641909877/real 1641909895] req@000000004ce8ca2e x1721665108794048/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641909921 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641909856/real 1641909856] req@0000000094fa6add x1721665108785472/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641909900 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: 26040:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641909860/real 1641909860] req@00000000f7b9087f x1721665108793024/t0(0) o41->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/368 e 0 to 1 dl 1641909904 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'lfs.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641909867/real 1641909867] req@00000000409db90a x1721665108793536/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641909911 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 2 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 2 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=2547 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=2547 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641911083/real 1641911099] req@000000003002e1f6 x1721665315873216/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641911127 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641911098/real 1641911101] req@000000007e3b24ba x1721665318205952/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641911142 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641911072/real 1641911072] req@0000000037bacdd0 x1721665314697024/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641911116 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x2c0667baa92fef98 to 0xd9ce118587b743e9
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 2 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 2 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=3771 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=3771 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641912293/real 1641912310] req@0000000085aa58ab x1721665542839360/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641912337 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641912308/real 1641912311] req@000000007652bc10 x1721665544825472/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641912352 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641912287/real 1641912287] req@00000000d95fc45d x1721665541615872/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641912331 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xd9ce118587b743e9 to 0xdd61fd8df9f947bf
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 3 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 3 times, and counting...
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000ec1adff4 x1721665510059136/t17180064190(17180064190) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641912502 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=5033 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=5033 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641913478/real 1641913494] req@0000000031b9d48f x1721665750135168/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913522 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641913483/real 1641913502] req@000000003688dcb8 x1721665750181056/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913527 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641913489/real 1641913506] req@0000000034c73fab x1721665750842560/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913533 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641913494/real 1641913510] req@000000000b555843 x1721665753307136/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913538 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000076b44c7 x1721665732147776/t21474933494(21474933494) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641913666 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641913458/real 1641913458] req@0000000019fe68fe x1721665748463488/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641913502 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 4 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 4 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=6215 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=6215 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641914683/real 1641914702] req@00000000d837a634 x1721666010161344/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914727 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641914688/real 1641914706] req@000000004d2420f6 x1721666011103168/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914732 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641914693/real 1641914711] req@000000006b628bfe x1721666011110656/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914737 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641914698/real 1641914715] req@000000000cbb3c5c x1721666011128448/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914742 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641914678/real 1641914678] req@00000000110bf663 x1721666010153344/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641914722 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 5 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 5 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=7407 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=7407 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641915888/real 1641915907] req@00000000f3f2ddae x1721666178773568/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641915932 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641915893/real 1641915911] req@00000000b4dafc3f x1721666179247936/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641915937 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641915899/real 1641915915] req@000000004c3d765e x1721666179761664/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641915943 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641915873/real 1641915873] req@00000000bc8623bc x1721666178289024/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641915917 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 6 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 6 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=8622 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=8622 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641917078/real 1641917094] req@00000000eb61a38f x1721666447615744/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641917122 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641917083/real 1641917103] req@0000000092968078 x1721666447623168/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641917127 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641917088/real 1641917107] req@0000000088928f67 x1721666448321280/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641917132 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641917068/real 1641917068] req@00000000ebbd8e4e x1721666447298496/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641917112 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000033f8e376 x1721666414096960/t34359811852(34359811852) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641917266 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 7 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 7 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=9814 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=9814 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641918282/real 1641918299] req@0000000098236704 x1721666734036288/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641918326 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641918288/real 1641918307] req@000000007efd3cef x1721666734301824/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641918332 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641918298/real 1641918314] req@0000000087b6b5f1 x1721666736951424/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641918342 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000080c5f413 x1721666707513600/t38654767736(38654767736) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641918457 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641918272/real 1641918272] req@00000000a9b732ee x1721666731912192/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641918316 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 8 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 8 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=11009 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=11009 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641919494/real 1641919511] req@00000000d364b87a x1721667001712640/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641919502 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641919499/real 1641919515] req@000000000c58febb x1721667002386688/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641919507 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641919504/real 1641919523] req@00000000de51ad0f x1721667003042112/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641919512 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0xdd61fd8df9f947bf to 0xd5cb733245e0bbfb
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641919489/real 1641919489] req@0000000087b6b5f1 x1721667001704384/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641919497 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 3 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 3 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=12212 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=12212 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641920704/real 1641920722] req@00000000ab7a5f22 x1721667276393216/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641920711 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641920709/real 1641920727] req@00000000d3db9bd3 x1721667277290048/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641920755 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641920714/real 1641920731] req@0000000082c3aa73 x1721667280550208/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641920760 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641920720/real 1641920739] req@000000002ef8c9f3 x1721667282582720/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641920766 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641920699/real 1641920699] req@000000000d4d3db0 x1721667273177280/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641920745 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xd5cb733245e0bbfb to 0x61cff7c6f92a222f
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 4 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 4 times, and counting...
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=13470 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=13470 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641921868/real 1641921868] req@00000000c09f8d42 x1721667529674944/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641921875 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
LustreError: 52548:0:(lmv_obd.c:1262:lmv_statfs()) lustre-MDT0000-mdc-ffff9cf5557a4800: can't stat MDS #0: rc = -11
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641921873/real 1641921873] req@000000008089a10b x1721667529686720/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641921881 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x61cff7c6f92a222f to 0x6a8dbfe4dffe5d86
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 5 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 5 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=14578 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=14578 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641923110/real 1641923127] req@000000008089a10b x1721667826652736/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641923117 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641923115/real 1641923134] req@00000000ce3e7492 x1721667828640192/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923161 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641923120/real 1641923138] req@00000000674a618f x1721667830246208/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923166 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641923125/real 1641923142] req@00000000faa952d1 x1721667831969024/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923171 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641923100/real 1641923100] req@000000007c32f4e6 x1721667824529664/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923146 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641923105/real 1641923105] req@000000005ce0674f x1721667825720000/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641923151 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0x6a8dbfe4dffe5d86 to 0x28cd6ee3b8635303
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 6 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 6 times, and counting...
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000029ee2b09 x1721667783457216/t47244859380(47244859380) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641923337 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=15869 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=15869 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641924290/real 1641924306] req@00000000c0004f41 x1721668084521408/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641924334 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641924295/real 1641924315] req@0000000096416ba7 x1721668086979712/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641924339 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641924305/real 1641924323] req@0000000039ca71e7 x1721668089829184/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641924349 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000001931ba9f x1721668060114368/t51539679322(51539679322) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641924478 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641924285/real 1641924285] req@000000009e7e0a87 x1721668081339968/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641924329 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 9 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 9 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=17027 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=17027 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 60759:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641925464/real 1641925464] req@0000000096416ba7 x1721668427720064/t0(0) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/1152 e 0 to 1 dl 1641925472 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'rm.0'
Lustre: 60759:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641925471/real 1641925471] req@00000000688a388e x1721668427721024/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641925479 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x28cd6ee3b8635303 to 0x3cce1aafc4a35f99
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 7 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 7 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=18171 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=18171 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641926709/real 1641926726] req@00000000a9f791cc x1721668705866688/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641926716 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641926715/real 1641926730] req@000000002f460642 x1721668707760960/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641926723 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641926720/real 1641926738] req@0000000081437b8a x1721668709333184/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641926728 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641926725/real 1641926743] req@00000000e2da85cf x1721668710645568/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641926733 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641926689/real 1641926689] req@00000000f41ba24b x1721668700215744/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641926697 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0x3cce1aafc4a35f99 to 0x15ba538cc0056173
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000002e608d0f x1721668652652736/t38654782649(38654782649) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/600 e 0 to 0 dl 1641926841 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 10 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 10 times, and counting...
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=19450 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=19450 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641927888/real 1641927892] req@000000006dfad8a4 x1721668949227712/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641927934 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641927877/real 1641927877] req@00000000fd61f3cb x1721668948092928/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641927923 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000934e0a33 x1721668925025728/t60129654331(60129654331) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641928099 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 11 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 11 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=20637 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=20637 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641929092/real 1641929110] req@000000009766d5b0 x1721669309189824/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641929138 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641929108/real 1641929111] req@000000002bb95a55 x1721669311119040/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641929154 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000934e0a33 x1721669283540416/t64424554980(64424554980) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641929204 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641929087/real 1641929087] req@0000000092cc6708 x1721669307531904/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641929133 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 12 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 12 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=21754 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=21754 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641930464/real 1641930464] req@0000000016c6bc01 x1721669670936640/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641930472 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1641930474/real 0] req@0000000028c3a584 x1721669671710464/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641930482 ref 2 fl Rpc:XNr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 13 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 13 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=23102 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=23102 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641931478/real 1641931478] req@00000000fc3e9eeb x1721669906480448/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641931485 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: 72069:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641931478/real 1641931478] req@00000000f0aede40 x1721669906481280/t0(0) o35->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:23/10 lens 392/624 e 0 to 1 dl 1641931485 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'dd.0'
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641931483/real 1641931483] req@00000000b303f4ba x1721669906482048/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641931491 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x15ba538cc0056173 to 0x30c19e657b56c98a
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 8 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 8 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=24189 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=24189 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641932722/real 1641932738] req@0000000074bf8636 x1721670157398656/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641932766 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641932738/real 1641932741] req@000000008324bf3d x1721670159077248/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641932782 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641932702/real 1641932702] req@000000002351ccaf x1721670153540800/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641932746 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0x30c19e657b56c98a to 0xe8ea3230671f3e6e
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000b20c946a x1721670113932928/t47244700810(47244700810) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/600 e 0 to 0 dl 1641932870 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 9 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 9 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=25433 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=25433 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641933902/real 1641933919] req@000000002351ccaf x1721670408059584/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641933910 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641933917/real 1641933920] req@00000000332edb71 x1721670409194112/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641933925 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000fc3e9eeb x1721670384152000/t77309482894(77309482894) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641933984 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641933897/real 1641933897] req@00000000f2173751 x1721670406085376/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641933905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 14 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 14 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=26571 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=26571 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641935098/real 1641935115] req@00000000ee4b96e7 x1721670691929984/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641935144 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641935113/real 1641935116] req@0000000048971edb x1721670692936384/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641935159 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000007aa69edf x1721670680872640/t81604452937(81604452937) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641935209 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 15 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 15 times, and counting...
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641935083/real 1641935083] req@00000000819367e5 x1721670691523520/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641935129 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641935093/real 1641935093] req@00000000d09005e0 x1721670691874560/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641935139 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=27761 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=27761 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641936305/real 1641936323] req@00000000bc77343d x1721671041002304/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936313 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641936310/real 1641936327] req@00000000a101f6de x1721671041968000/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936318 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641936316/real 1641936331] req@00000000ba53cb5c x1721671042863424/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936324 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641936321/real 1641936337] req@0000000037bb6f6b x1721671044140032/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936329 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0xe8ea3230671f3e6e to 0x89e752e48866fbd8
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000902a45de x1721671013056640/t51539805408(51539805408) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641936438 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641936290/real 1641936290] req@00000000f6b90054 x1721671039508416/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641936298 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 10 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 10 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=29029 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=29029 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641937520/real 1641937539] req@00000000a7007eec x1721671346967808/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641937527 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641937525/real 1641937542] req@00000000e989cf6d x1721671347854336/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641937533 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641937530/real 1641937546] req@0000000008c3f8cd x1721671348861376/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641937538 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641937535/real 1641937552] req@00000000768df730 x1721671349650944/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641937543 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0x89e752e48866fbd8 to 0x7c968768543a351c
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000099c2ad2a x1721671311053120/t55834665810(55834665810) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 648/600 e 0 to 0 dl 1641937645 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641937515/real 1641937515] req@00000000c29f8e3f x1721671345343104/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641937523 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 16 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 16 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=30262 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=30262 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641938704/real 1641938719] req@0000000053740272 x1721671704752832/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641938712 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641938719/real 1641938722] req@00000000902a45de x1721671709414656/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641938727 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x7c968768543a351c to 0xb9fcf8380bb0a672
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000099b6b4ad x1721671696766592/t60129597602(60129597602) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641938868 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641938683/real 1641938683] req@000000008202ccd3 x1721671699893888/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641938691 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 11 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 11 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=31427 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=31427 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641939893/real 1641939893] req@000000000848fa73 x1721672072210176/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641939900 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641939893/real 1641939893] req@00000000b1bf3629 x1721672072210240/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641939901 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641939899/real 1641939899] req@000000009eaf20ad x1721672072210432/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641939907 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xb9fcf8380bb0a672 to 0x2869ec10b340e9b3
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000678ab890 x1721672027449408/t64424575609(64424575609) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 608/600 e 0 to 0 dl 1641940026 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 17 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 17 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=32615 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=32615 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641941105/real 1641941123] req@00000000c44efa57 x1721672394969536/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641941151 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@000000009f4098ad x1721672393424256/t94489318772(94489318772) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641941225 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641941095/real 1641941095] req@00000000d02e1b64 x1721672394954304/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641941141 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 18 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 18 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=33777 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=33777 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 96733:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641942288/real 1641942288] req@00000000fee45b25 x1721672803164480/t0(0) o35->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:23/10 lens 392/624 e 0 to 1 dl 1641942295 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'dd.0'
Lustre: 96733:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641942290/real 1641942290] req@00000000a10aeceb x1721672803165312/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641942336 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641942295/real 1641942295] req@0000000082f46ed6 x1721672803165568/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641942341 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 19 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 19 times, and counting...
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000781ee55e x1721672703195712/t98784297415(98784297415) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641942413 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=34943 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=34943 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641943529/real 1641943547] req@00000000e28bc600 x1721673181356992/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943573 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641943534/real 1641943551] req@0000000025825f90 x1721673181608704/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943578 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641943539/real 1641943555] req@00000000d3c205e0 x1721673182036416/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943583 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641943545/real 1641943561] req@0000000044cf5f33 x1721673182173504/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943589 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641943514/real 1641943514] req@0000000072791521 x1721673179968320/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641943558 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000cf93a0d1 x1721673106339392/t103079271407(103079271407) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641943635 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 20 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 20 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=36184 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=36184 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1641944738/real 0] req@00000000aaf6af78 x1721673430074304/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641944745 ref 2 fl Rpc:XNr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641944738/real 1641944755] req@000000004a07fb0d x1721673430074368/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641944784 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641944744/real 1641944759] req@00000000412b9686 x1721673430623616/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641944790 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641944718/real 1641944718] req@00000000dc575401 x1721673424520640/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641944764 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641944728/real 1641944728] req@00000000d011aadf x1721673428338816/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641944774 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x2869ec10b340e9b3 to 0xb30e277e7821f89b
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 12 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 12 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=37422 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=37422 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641945955/real 1641945958] req@000000004be3d4e0 x1721673659613184/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641946001 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641945935/real 1641945935] req@000000002071aac8 x1721673654736384/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641945981 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641945929/real 1641945929] req@00000000aa49c1ef x1721673653445696/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641945975 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xb30e277e7821f89b to 0xbf5b38fddc4dd777
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 13 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 13 times, and counting...
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=38652 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=38652 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641947117/real 1641947135] req@00000000a97a2781 x1721673829299584/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641947161 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641947122/real 1641947139] req@00000000fa7aecb0 x1721673829722432/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641947166 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641947127/real 1641947146] req@00000000a37f3a15 x1721673829768192/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641947171 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000243e21ee x1721673813906688/t111669187631(111669187631) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641947243 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 21 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 21 times, and counting...
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641947112/real 1641947112] req@0000000061c4c4c9 x1721673829291584/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641947156 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=39787 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=39787 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641948327/real 1641948342] req@00000000914b72f4 x1721674006028800/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948371 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641948332/real 1641948350] req@00000000f492794c x1721674006084224/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948376 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641948337/real 1641948354] req@000000002d750631 x1721674006340736/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948381 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641948342/real 1641948358] req@00000000d62fe959 x1721674006349248/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948386 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000375937ab x1721673981731200/t115964239485(115964239485) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641948442 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641948321/real 1641948321] req@000000002864288b x1721674006020480/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641948365 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 22 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 22 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=40988 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=40988 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641949546/real 1641949550] req@00000000f492794c x1721674247782976/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641949590 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000a37f3a15 x1721674221628864/t120259150764(120259150764) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641949675 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 23 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 23 times, and counting...
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641949536/real 1641949536] req@00000000fad0c7f3 x1721674247165440/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641949580 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=42221 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=42221 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641950728/real 1641950748] req@00000000b51ebf25 x1721674522467776/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641950735 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641950733/real 1641950752] req@00000000aacc57fa x1721674523822144/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950741 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641950738/real 1641950755] req@000000006a4f78e8 x1721674524402560/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950746 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641950743/real 1641950759] req@00000000957542c6 x1721674525328128/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950751 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000603e1b8b x1721674500216512/t77309801101(77309801101) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 648/600 e 0 to 0 dl 1641950832 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'df.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641950713/real 1641950713] req@00000000ccfd96b0 x1721674517642432/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950721 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641950723/real 1641950723] req@000000000b663dd6 x1721674520721920/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641950731 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 14 times, and counting...
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0xbf5b38fddc4dd777 to 0xe35e4df03c609009
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: mds1 has failed over 14 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=43420 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=43420 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641951930/real 1641951947] req@00000000595fa783 x1721674779246144/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641951937 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641951945/real 1641951948] req@0000000036f58690 x1721674782792896/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641951991 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xe35e4df03c609009 to 0x813117028d1b09ea
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@00000000ec947795 x1721674758516096/t81604486357(81604486357) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 576/600 e 0 to 0 dl 1641952080 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641951919/real 1641951919] req@00000000a9e49b40 x1721674776412800/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641951965 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 15 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 15 times, and counting...
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000087835faf x1721674758515584/t124554160968(124554160968) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 608/600 e 0 to 0 dl 1641952131 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'lfs.0'
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=44654 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=44654 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641953140/real 1641953158] req@0000000047c592da x1721675041788352/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953148 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641953145/real 1641953162] req@00000000aee522e5 x1721675042637376/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953153 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641953150/real 1641953167] req@00000000fca5e081 x1721675044588480/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953158 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641953155/real 1641953172] req@000000000ea49979 x1721675045509056/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953163 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 24 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 24 times, and counting...
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641953130/real 1641953130] req@00000000380bb76f x1721675039164416/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641953138 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=45800 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=45800 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641954323/real 1641954339] req@0000000008fb39ac x1721675289991232/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.144@tcp:26/25 lens 224/224 e 0 to 1 dl 1641954330 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641954328/real 1641954343] req@00000000a37f3a15 x1721675292029632/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641954336 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641954333/real 1641954351] req@000000000ddc4cbd x1721675292529344/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641954341 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: Evicted from MGS (at 10.240.42.143@tcp) after server handle changed from 0x813117028d1b09ea to 0xed24f2b7899fd13c
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000071d7a2d2 x1721675285146624/t85899532328(85899532328) o101->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641954452 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641954318/real 1641954318] req@00000000a734bdbe x1721675287693120/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641954326 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 16 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 16 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=47044 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=47044 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641955520/real 1641955520] req@00000000af84ce89 x1721675539534912/t0(0) o400->MGC10.240.42.143@tcp@10.240.42.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1641955527 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.143@tcp) was lost; in progress operations using this service will fail
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641955544/real 1641955562] req@0000000024960aeb x1721675539537472/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641955590 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641955520/real 1641955520] req@000000001e8a4178 x1721675539535040/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641955566 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641955554/real 1641955570] req@000000003af6518b x1721675539537856/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641955600 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641955533/real 1641955533] req@00000000e962f5bc x1721675539537088/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641955579 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: Evicted from MGS (at 10.240.42.144@tcp) after server handle changed from 0xed24f2b7899fd13c to 0x1753d2d5422b7f62
Lustre: MGC10.240.42.143@tcp: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.144@tcp (at 10.240.42.144@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 17 times, and counting...
Lustre: DEBUG MARKER: mds1 has failed over 17 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=48242 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=48242 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds2 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0001.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0001-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds2
Lustre: DEBUG MARKER: Starting failover on mds2
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641956742/real 1641956758] req@00000000784f52c3 x1721675862288704/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641956788 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection to lustre-MDT0001 (at 10.240.42.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641956747/real 1641956767] req@00000000703807ed x1721675863299968/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641956793 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641956752/real 1641956771] req@0000000013bc9dcb x1721675864695232/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641956798 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
LustreError: 8296:0:(client.c:3180:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@0000000043e35ee8 x1721675808011008/t137439011322(137439011322) o101->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 576/600 e 0 to 0 dl 1641956862 ref 2 fl Interpret:RPQU/4/0 rc 301/301 job:'dd.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641956732/real 1641956732] req@00000000d53778aa x1721675859098432/t0(0) o400->lustre-MDT0001-mdc-ffff9cf5557a4800@10.240.42.143@tcp:12/10 lens 224/224 e 0 to 1 dl 1641956778 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Lustre: lustre-MDT0001-mdc-ffff9cf5557a4800: Connection restored to 10.240.42.143@tcp (at 10.240.42.143@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: trevis-78vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 has failed over 25 times, and counting...
Lustre: DEBUG MARKER: mds2 has failed over 25 times, and counting...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=49410 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=49410 DURATION=86400 PERIOD=1200
Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: trevis-78vm9.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL\|IDLE state before next failover
Lustre: DEBUG MARKER: Checking clients are in FULL|IDLE state before next failover
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL\|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm4.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm5.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: trevis-78vm3.trevis.whamcloud.com: executing wait_import_state_mount FULL|IDLE mdc.lustre-MDT0000-mdc-*.mds_server_uuid
Lustre: DEBUG MARKER: lctl get_param -n at_max
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
Lustre: DEBUG MARKER: Starting failover on mds1
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641957943/real 1641957959] req@000000002e072981 x1721676134565824/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641957989 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: lustre-MDT0000-mdc-ffff9cf5557a4800: Connection to lustre-MDT0000 (at 10.240.42.144@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 166-1: MGC10.240.42.143@tcp: Connection to MGS (at 10.240.42.144@tcp) was lost; in progress operations using this service will fail
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1641957948/real 1641957968] req@0000000018025cea x1721676135709632/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641957994 ref 1 fl Rpc:eXNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 8297:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641957922/real 1641957922] req@000000000e75c9bc x1721676131893184/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641957968 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1641957938/real 1641957938] req@0000000013bc9dcb x1721676134153664/t0(0) o400->lustre-MDT0000-mdc-ffff9cf5557a4800@10.240.42.144@tcp:12/10 lens 224/224 e 0 to 1 dl 1641957984 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
Lustre: 8298:0:(client.c:2290:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Link to test
Return to new crashes list