post-mortem: f24 boot fails; need help.

List overview All Threads
Download

newer

older

AMD Ryzen 7 freeze on YouTube

Control audio output between Line...

William

24 May 2017 24 May '17

10:38 a.m.

Good morning,

The "f24 boot fails; need help" problem set me back a week. I'm still catching up. I seriously believe it would be foolish for me to just forget it. I should for the benefit of others try to get at the real cause and possible prevention.

A few hours before the failure, I received and looked at an e-mail that I'm almost certain was at least a spoof, and possibly malicious. I know it contained html and links. I did *** not *** click any of the links. I looked at it, and deleted it. It was viewed in Thunderbird only. The message's "From" ended with "yahoo.com". My question: It is highly improbable that that message had anything to do with the boot failure. Am I correct?

Also a few hours before the failure, I did some web browsing using Firefox with NoScript and uBlock Origin. As best as I recall, the "riskiest" sites that I visited were finance.yahoo.com (and a few of its sub-pages, I clicked no ads, no ad links) and indeed.com (possibly and a posting or two). My question: It is highly improbable that my web browsing had anything to do with the boot failure. Am I correct?

thanks, Bill.

Show replies by date

Sam Varshavchik

24 May 24 May

12:17 p.m.

William writes:

...

A few hours before the failure, I received and looked at an e-mail that I'm almost certain was at least a spoof, and possibly malicious. I know it contained html and links. I did *** not *** click any of the links. I looked at it, and deleted it. It was viewed in Thunderbird only. The message's "From" ended with "yahoo.com". My question: It is highly improbable that that message had anything to do with the boot failure. Am I correct?

You are correct.

...

Also a few hours before the failure, I did some web browsing using Firefox with NoScript and uBlock Origin. As best as I recall, the "riskiest" sites that I visited were finance.yahoo.com (and a few of its sub-pages, I clicked no ads, no ad links) and indeed.com (possibly and a posting or two). My question: It is highly improbable that my web browsing had anything to do with the boot failure. Am I correct?

You are correct, again.

Rick Stevens

12:24 p.m.

On 05/24/2017 08:38 AM, William wrote:

...

Good morning,

The "f24 boot fails; need help" problem set me back a week. I'm still catching up. I seriously believe it would be foolish for me to just forget it. I should for the benefit of others try to get at the real cause and possible prevention.

A few hours before the failure, I received and looked at an e-mail that I'm almost certain was at least a spoof, and possibly malicious. I know it contained html and links. I did *** not *** click any of the links. I looked at it, and deleted it. It was viewed in Thunderbird only. The message's "From" ended with "yahoo.com". My question: It is highly improbable that that message had anything to do with the boot failure. Am I correct?

Also a few hours before the failure, I did some web browsing using Firefox with NoScript and uBlock Origin. As best as I recall, the "riskiest" sites that I visited were finance.yahoo.com (and a few of its sub-pages, I clicked no ads, no ad links) and indeed.com (possibly and a posting or two). My question: It is highly improbable that my web browsing had anything to do with the boot failure. Am I correct?

It is unlikely that you got infected by the email--especially if you didn't click on any links and you have "Show remote content" turned off in the mail client.

The browser is probably safe as well. I don't use Firefox (I use Chrome) and I have Adblock, UBlock, UMatrix and PrivacyBadger enabled on it. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, AllDigital ricks@alldigital.com - - AIM/Skype: therps2 ICQ: 226437340 Yahoo: origrps2 - - - - Vegetarian: Old Indian word for "lousy hunter" - ----------------------------------------------------------------------

William Mattison

11:20 p.m.

Thank-you Sam and Rick.

For the next 2 questions, I'm not looking for numerical answers. Qualitative probability terms on a scale going from "highly improbably" to "almost certainly" would be great.

The clock (and the CMOS battery) got some attention while trying to fix the boot problem. I have not yet replaced the battery, but I'm not seeing any problems. What is the likelihood that the battery or the clock caused the boot failure?

The boot failure occurred right after doing my weekly "dnf upgrade". What is the likelihood that the "dnf upgrade" (or one of the patches installed by it) caused the problem?

thanks, Bill.

Joe Zeff

25 May 25 May

1:40 a.m.

On 05/24/2017 09:20 PM, William Mattison wrote:

...

The clock (and the CMOS battery) got some attention while trying to fix the boot problem. I have not yet replaced the battery, but I'm not seeing any problems. What is the likelihood that the battery or the clock caused the boot failure?

If the battery's weak enough to mess up the CMOS, it's possible. However, long before that, your hardware clock will start to run slow. (This is, actually a built in feature. It's intended to let you know that it's time to change the battery.) If you can go into the CMOS setup, before it tries to boot, see if everything looks right, and that the clock is right. If it's slow, turn things off without correcting it, and try again in a few hours. If it's farther behind, change the battery and see if that helps. I don't know if it's still true, but the Print Screen key used to work there, and if so, you can use it to get a printout of your settings to be used later if needed.

Rick Stevens

2:47 p.m.

On 05/24/2017 11:40 PM, Joe Zeff wrote:

...

On 05/24/2017 09:20 PM, William Mattison wrote:

...
The clock (and the CMOS battery) got some attention while trying to fix the boot problem. I have not yet replaced the battery, but I'm not seeing any problems. What is the likelihood that the battery or the clock caused the boot failure?

If the battery's weak enough to mess up the CMOS, it's possible. However, long before that, your hardware clock will start to run slow. (This is, actually a built in feature. It's intended to let you know that it's time to change the battery.) If you can go into the CMOS setup, before it tries to boot, see if everything looks right, and that the clock is right. If it's slow, turn things off without correcting it, and try again in a few hours. If it's farther behind, change the battery and see if that helps. I don't know if it's still true, but the Print Screen key used to work there, and if so, you can use it to get a printout of your settings to be used later if needed.

I agree with Joe. I'd imagine the battery would only cause issues if the BIOS got messed up somehow. A slow clock wouldn't necessarily cause a boot issue--but you might get a lot of weird "file date is in the future" errors caused by the OS looking at the clock (which is slow) and comparing it against file dates (which were set using the correct time). This can also cause strangeness with LDAP and Kerberos authentication or Samba operations as they're time-sensitive.

Otherwise, with a weak battery the BIOS will usually revert to default settings which are generally considered conservative and "safe". If it only managed to partially set up the defaults, weird stuff can happen. Note that if you had modified the boot order from the BIOS-based defaults, then you'd usually see the boot stall at the device selection point. If you set some other things (disk caches, memory timings, wait states and other items which gamers tend to mess with) then it could cause additional funkiness. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, AllDigital ricks@alldigital.com - - AIM/Skype: therps2 ICQ: 226437340 Yahoo: origrps2 - - - - Admitting you have a problem is the first step toward getting - - medicated for it. -- Jim Evarts (http://www.TopFive.com) - ----------------------------------------------------------------------

Tim

26 May 26 May

6:52 a.m.

On Thu, 2017-05-25 at 12:47 -0700, Rick Stevens wrote:

...

Otherwise, with a weak battery the BIOS will usually revert to default settings which are generally considered conservative and "safe".

I'm not so sure that's the case. In many PCs, the BIOS clock, BIOS memory, and perhaps other BIOS hardware, are powered solely by the battery (even when the computer is running off mains power). So, with failing power you could have all manner of random things happen. Digital circuits don't work well when not fully powered.

If it had completely failed, then I might expect default settings to be adopted at power up - assuming that the computer would power up with a dead BIOS battery.

Though some BIOSs use an EEPROM as non-volatile memory, rather than just low-power RAM with a battery to keep it working. Making a loss of settings very hard. A friend of mine had a PC with a three-way switch to decide which BIOS settings to use when booting up, and if I recall correctly, two of them were EEPROM stored. It was designed as a geeks motherboard, you could use the feature to have turbo settings, stable settings, experimental settings, and always be able to boot up by flipping the switch if you'd changed something in a bad way.

If you believe your BIOS settings may have been scrambled, it may be a good idea to select the reset to default options, save them, go back and set any personal options, to force that all BIOS settings are reset.

I'm still not convinced with the cargo-cult idea that the BIOS clock is actually designed to run slow, rather than that simply being a common side-effect. I've certainly had a motherboard where that effect did not happen.

Joe Zeff

1:40 p.m.

On 05/26/2017 04:52 AM, Tim wrote:

...

I'm still not convinced with the cargo-cult idea that the BIOS clock is actually designed to run slow, rather than that simply being a common side-effect. I've certainly had a motherboard where that effect did not happen.

I've had several slow-clock issues over the decades solved by changing the battery, and I've never had it not work. Just because you don't believe it doesn't make it "cargo-cult."

William Mattison

29 May 29 May

9:42 p.m.

Good evening,

Hardware problems have seriously tied me up for about a week now. My apologies for my silence on this topic. The hardware issue is not really fixed yet. I likely will be forced off-line again for several days to a few weeks. If I'm not responding; assume that that's what's happening.

The fix on Thursday, May 18 did not last. This past Thursday, my workstation again failed to boot. This time, it dropped me into an emergency shell, not the dracut shell. This time, the log file was almost twice as long. But it reported fsck failures again, this time on sda7 rather than sda6. So I tried what my friend did, but with "/dev/sda7" instead of "/dev/sda6" as the command parameter. I spent 30-45 minutes doing nothing but rapidly hitting the 'y' key before the command finally completed. (Apparently, hundreds of i-nodes were corrupted this time.) Then the workstation successfully booted.

I think I spent a week trying to get into BIOS. But I wasn't seeing a BIOS screen before the grub menu showed up. I think it was when I shut down and started up a different way that I finally saw the BIOS screen. I quickly changed the time for the BIOS screen from 2 seconds to 8 seconds. As suggested in this discussion, I checked the voltages and the clock. The voltages looked fine. The clock was about 5 seconds slow compared to my "atomic" clock. I adjusted that. This morning, the clock seemed barely noticeably slow compared to that atomic clock, but by less than a second. So I'm agreeing with your suspicions that the battery is getting low.

This morning, I tried to replace the battery. Most of the motherboard (ASUS Sabertooth Z77, bought in early 2013) is covered by a hard, dark gray plastic cover. The battery should be under that, below the graphics card socket. I could not find a way of getting that cover off. Neither the user's guide nor the support dvd provided any clues. The ASUS web site GUI for submitting a support request did not work. Any ideas?

If I have to replace the motherboard, will I have to re-install Fedora and windows-7 (it's a dual-boot system)?

I find it odd that this problem: * did not seem to affect windows-7 (yet?). * happened only immediately after doing my weekly Fedora patches ("dnf upgrade"). * did not occur for a week between the first and second occurrences. * would corrupt so many i-nodes the second time.

Once the battery gets low enough, I'll have no access to the internet or this list. How can I get help if I need it? My problems will be beyond what my local IT friends can handle.

Thank-you for your help so far. Bill.

Joe Zeff

9:56 p.m.

On 05/29/2017 07:42 PM, William Mattison wrote:

...

The clock was about 5 seconds slow compared to my "atomic" clock. I adjusted that. This morning, the clock seemed barely noticeably slow compared to that atomic clock, but by less than a second. So I'm agreeing with your suspicions that the battery is getting low.

If your battery is getting low, it's just barely starting. Usually, when it becomes an issue, you see a change of minutes per day, not one or two seconds. Still, changing it can't hurt. However, your hard disk issues are making me wonder if either the disk or the controller aren't at fault. It's clearly a hardware issue to me, but there are still several possibilities for just what's gone bad.

Tim

30 May 30 May

12:44 p.m.

On Tue, 2017-05-30 at 02:42 +0000, William Mattison wrote:

...

The fix on Thursday, May 18 did not last. This past Thursday, my workstation again failed to boot. This time, it dropped me into an emergency shell, not the dracut shell. This time, the log file was almost twice as long. But it reported fsck failures again, this time on sda7 rather than sda6. So I tried what my friend did, but with "/dev/sda7" instead of "/dev/sda6" as the command parameter. I spent 30-45 minutes doing nothing but rapidly hitting the 'y' key before the command finally completed. (Apparently, hundreds of i-nodes were corrupted this time.) Then the workstation successfully booted.

I think I spent a week trying to get into BIOS. But I wasn't seeing a BIOS screen before the grub menu showed up. I think it was when I shut down and started up a different way that I finally saw the BIOS screen. I quickly changed the time for the BIOS screen from 2 seconds to 8 seconds. As suggested in this discussion, I checked the voltages and the clock. The voltages looked fine. The clock was about 5 seconds slow compared to my "atomic" clock. I adjusted that. This morning, the clock seemed barely noticeably slow compared to that atomic clock, but by less than a second. So I'm agreeing with your suspicions that the battery is getting low.

Actually, I wouldn't call the BIOS clock being 5 seconds off much to worry about (with regards to the battery). They're not that particularly accurate, to begin with, on a par with a cheap wristwatch. However, if your battery is a few years old, you may as well replace it now that you're in the mood to do so. They do have a finite lifespan.

If the BIOS voltage monitors say the voltages are fine, they probably are. Though they're not always super accurate, either. Software that lets you read these values when the OS is running needs to modify them with correction factors.

Since you talk about many file system errors, and difficulty booting, I'm inclined to point the finger at the main power supply. If it's not up to the task of powering everything, or is randomly glitching, that could cause all sorts of instabilities.

Though, as you're taking things apart. It may well be a good idea to unplug everything, and reconnect, just to exercise the connections (cars, RAM, cables, etc). Cards have a habit of walking out due to thermal changes, or mechanical stress when moving a flimsy case around. Clean any exposed slots (e.g. unused PCIe slots).

I just did this simple search, and there's even videos of how to change the battery, right at the top. Though I don't think much of one person's crude "cut through the cover" technique. https://www.google.com.au/search?q=ASUS+Sabertooth+Z77+bios+battery

This seems more sensible: https://www.youtube.com/watch?v=aSTTR_WVtx0 long video, but he's done it by 4 minutes in.

Perhaps ASUS thinks that by the time the battery is crapping out, you'll have reached the stage of wanting to buy a newer PC.

-- [tim@localhost ~]$ uname -rsvp Linux 3.9.10-100.fc17.x86_64 #1 SMP Sun Jul 14 01:31:27 UTC 2013 x86_64 (always current details of the computer that I'm writing this email on) Boilerplate: All mail to my mailbox is automatically deleted, there is no point trying to privately email me, I only get to see the messages posted to the mailing list. The weekly life-cycle of the electronics enthusiast: Monday: Get an idea, and draft it out. Tuesday: Go and buy the parts. Wednesday: Solder the components together. Thursday: Build the casing and install the electronics. Friday: Start getting it to work and fine tuning. Saturday: Neatly install the finished product and use it for an hour. Sunday: Watch smoke escape when you turn it on, prepare shopping list for new parts to buy, tomorrow.

William Mattison

11:09 p.m.

I wasn't fully convinced these problems are due to the battery. That's why I listed the four things I found "odd". On the other hand, I recall hearing and reading that the output of lithium batteries is almost flat (better than any other type of battery), but then very quickly drops (faster than any other type of battery) as it reaches end-of-life.

Back to diagnosing the real cause of the problems...

Is there a Fedora command that I can use to check the hard drive (not the file systems) for bad blocks, sectors, tracks, etc? Is there a Fedora command that I can use to check the controller?

Both problems occurred immediately after doing a "dnf upgrade". What is that telling us? Does "dnf upgrade" access the hard drive or the controller in a way that normal daily use does not? Is there something different about the first boot after a "dnf upgrade" vs other boots? I shut down every night, and boot up every morning.

When I bought the system 4+ years ago, I bought separate parts. This is a DIY desktop. I was advised to buy more power supply than needed. I did so. So unless the power supply is failing, I would think it's not a good candidate for the cause of the two problems. There have been no problems until this month, and I've been doing weekly patches since I got the system in 2013.

I was/am not in the mood to change the battery! Since I've already bought the new one and have no other use for it, and since the old one is 4+ years old, I plan to change the battery either Friday or Saturday. But you know what they say: "If you want to make God laugh, tell Him your plans!". I did watch the youtube that Tim provided. I don't recall seeing screws on the underside of the motherboard. I'll look again Friday or Saturday (God willing!).

Thanks, Bill.

Rick Leir

31 May 31 May

2:13 a.m.

Bill,

Power supplies can fail at any time, and they are less reliable than any other parts in my PC's.

PC's are more reliable if you leave them on, configured to go into sleep mode when left unused (this statement will spark a discussion).

Most spinning disk drives these days support smartd, smartctl.

http://www.linuxjournal.com/magazine/monitoring-hard-disks-smart

An exception would be hardware RAID (shown below), its manufacturer would supply management tools.

With smartctl, expect the failure counts to be non-zero, all disks have errors which get remapped. The error counts can be alarming even though the disk is fine for normal use. Any sudden increase in errors and you will want to save some good backups Right Soon Now.

Spinning disks often fail gradually over days or weeks. SSDs can suddenly drop completely, with no remedy.

cheers -- Rick

--RAID--

=== START OF INFORMATION SECTION === Vendor: HP Product: LOGICAL VOLUME Revision: 3.66 User Capacity: 1,200,186,941,440 bytes [1.20 TB] Logical block size: 512 bytes Rotation Rate: 15000 rpm Logical Unit id: 0x600508b1001c4a3abee7e559b116e419 Serial number: 50014380145ECE10 Device type: disk Local Time is: Wed May 31 01:58:57 2017 CDT SMART support is: Unavailable - device lacks SMART capability.

--SSD--

=== START OF INFORMATION SECTION === Model Family: Samsung based SSDs Device Model: Samsung SSD 850 EVO 250GB Serial Number: S21NNS999999 .. SMART support is: Available - device has SMART capability. SMART support is: Enabled ..

=== START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F3 Device Model: SAMSUNG HD103SJ Serial Number: S2QPJ9KB99999 .. SMART support is: Available - device has SMART capability. SMART support is: Enabled ..

----

On 2017-05-31 12:09 AM, William Mattison wrote:

...

I wasn't fully convinced these problems are due to the battery. That's why I listed the four things I found "odd". On the other hand, I recall hearing and reading that the output of lithium batteries is almost flat (better than any other type of battery), but then very quickly drops (faster than any other type of battery) as it reaches end-of-life.

Back to diagnosing the real cause of the problems...

Is there a Fedora command that I can use to check the hard drive (not the file systems) for bad blocks, sectors, tracks, etc? Is there a Fedora command that I can use to check the controller?

Both problems occurred immediately after doing a "dnf upgrade". What is that telling us? Does "dnf upgrade" access the hard drive or the controller in a way that normal daily use does not? Is there something different about the first boot after a "dnf upgrade" vs other boots? I shut down every night, and boot up every morning.

When I bought the system 4+ years ago, I bought separate parts. This is a DIY desktop. I was advised to buy more power supply than needed. I did so. So unless the power supply is failing, I would think it's not a good candidate for the cause of the two problems. There have been no problems until this month, and I've been doing weekly patches since I got the system in 2013.

I was/am not in the mood to change the battery! Since I've already bought the new one and have no other use for it, and since the old one is 4+ years old, I plan to change the battery either Friday or Saturday. But you know what they say: "If you want to make God laugh, tell Him your plans!". I did watch the youtube that Tim provided. I don't recall seeing screws on the underside of the motherboard. I'll look again Friday or Saturday (God willing!).

Thanks, Bill. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org

William Mattison

8:53 p.m.

I did "smartctl --all /dev/sda > smartctl_out.txt". I got over 200 lines of output. The most recent error reported in the output file is this one:

=============== Error 66 occurred at disk power-on lifetime: 13741 hours (572 days + 13 hours) When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 ff ff ff 4f 00 00:05:43.747 READ FPDMA QUEUED 61 00 08 ff ff ff 4f 00 00:05:43.746 WRITE FPDMA QUEUED ea 00 00 00 00 00 a0 00 00:05:43.746 FLUSH CACHE EXT ef 10 02 00 00 00 a0 00 00:05:43.746 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 00:05:43.745 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ===============

I can't really make heads or tails of this. I also notice in my system e-mail these 2 messages, bot on Thursday, May 25: (1st message) Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors (2nd message, 1 minute later) Device: /dev/sda [SAT], 8 Offline uncorrectable sectors

I also tried "smartctl -t short /dev/sda", followed later by "smartctl -l selftest /dev/sda". The result: =============== SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 13813 - =============== If I understand the "-all" output correctly, the "-long" test would take about 4 hours, so I'm not trying that until later this week.

What else from the "smartctl" output should I post here? What other "smartctl" functionality should I try or use?

Thank-you. Bill.

Joseph Loo

6:49 a.m.

On 05/30/2017 09:09 PM, William Mattison wrote:

...

I wasn't fully convinced these problems are due to the battery. That's why I listed the four things I found "odd". On the other hand, I recall hearing and reading that the output of lithium batteries is almost flat (better than any other type of battery), but then very quickly drops (faster than any other type of battery) as it reaches end-of-life.

Back to diagnosing the real cause of the problems...

Is there a Fedora command that I can use to check the hard drive (not the file systems) for bad blocks, sectors, tracks, etc? Is there a Fedora command that I can use to check the controller?

Both problems occurred immediately after doing a "dnf upgrade". What is that telling us? Does "dnf upgrade" access the hard drive or the controller in a way that normal daily use does not? Is there something different about the first boot after a "dnf upgrade" vs other boots? I shut down every night, and boot up every morning.

When I bought the system 4+ years ago, I bought separate parts. This is a DIY desktop. I was advised to buy more power supply than needed. I did so. So unless the power supply is failing, I would think it's not a good candidate for the cause of the two problems. There have been no problems until this month, and I've been doing weekly patches since I got the system in 2013.

I was/am not in the mood to change the battery! Since I've already bought the new one and have no other use for it, and since the old one is 4+ years old, I plan to change the battery either Friday or Saturday. But you know what they say: "If you want to make God laugh, tell Him your plans!". I did watch the youtube that Tim provided. I don't recall seeing screws on the underside of the motherboard. I'll look again Friday or Saturday (God willing!).

Thanks, Bill. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org

Have you tried badblocks? If you are not careful it will wipe your disk completely. This will do a sector by sector scan.

-- Joseph Loo jloo@acm.org

William Mattison

2 Jun 2 Jun

8:32 p.m.

I tried badblocks last night. I didn't realize how long it would take. After over 3 hours, I had to abort it to do something else.

This morning, I retried it, this time with options to show its progress. It took between 3 1/2 and 3 3/4 hours. Here are the results: =============== bash.3[~]: badblocks -s -v /dev/sda Checking blocks 0 to 1953514583 Checking for bad blocks (read-only test): done Pass completed, 0 bad blocks found. (0/0/0 errors) bash.4[~]: =============== I don't think this completely rules out the hard drive as the villain, but it's now less of a suspect. Am I correct in guessing that the non-destructive read-write option (option "-n") would take over twice as long (7 1/2 or more hours)?

Thanks, Bill.

Joseph Loo

10:04 p.m.

On 06/02/2017 06:32 PM, William Mattison wrote:

...

I tried badblocks last night. I didn't realize how long it would take. After over 3 hours, I had to abort it to do something else.

This morning, I retried it, this time with options to show its progress. It took between 3 1/2 and 3 3/4 hours. Here are the results:

bash.3[~]: badblocks -s -v /dev/sda Checking blocks 0 to 1953514583 Checking for bad blocks (read-only test): done Pass completed, 0 bad blocks found. (0/0/0 errors) bash.4[~]: =============== I don't think this completely rules out the hard drive as the villain, but it's now less of a suspect. Am I correct in guessing that the non-destructive read-write option (option "-n") would take over twice as long (7 1/2 or more hours)?

Thanks, Bill. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org

It would take about 4 times longer. I believe it reads and writes with 4 different patterns. It will wipe the disk completely.

-- Joseph Loo jloo@acm.org

William Mattison

3 Jun 3 Jun

11:17 p.m.

According to the man page, the "-n" option is non-destructive; the "-w" option is what you described.

Regardless, it's too long.

Bill.

Louis Lagendijk

4:37 a.m.

On Sat, 2017-06-03 at 01:32 +0000, William Mattison wrote:

...

I tried badblocks last night. I didn't realize how long it would take. After over 3 hours, I had to abort it to do something else.

This morning, I retried it, this time with options to show its progress. It took between 3 1/2 and 3 3/4 hours. Here are the results: =============== bash.3[~]: badblocks -s -v /dev/sda Checking blocks 0 to 1953514583 Checking for bad blocks (read-only test): done Pass completed, 0 bad blocks found. (0/0/0 errors) bash.4[~]: =============== I don't think this completely rules out the hard drive as the villain, but it's now less of a suspect. Am I correct in guessing that the non-destructive read-write option (option "-n") would take over twice as long (7 1/2 or more hours)?

Thanks, Bill. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org

I have not followed this thread closely, but did you check the harddisk cable. They are often a source for problems...

Louis

William Mattison

11:23 p.m.

While changing the motherboard battery yesterday (Friday), most cables were disconnected and then later re-connected. That included the hard drive connection to the motherboard. I also disconnected and reconnected both the power cable and the data cable where they plug in to the hard drive itself. The hard drive was removed, vacuumed, and put back into its place.

No problems since, but I haven't done a "dnf upgrade" since then.

thanks, Bill.

Tim

5 Jun 5 Jun

12:50 a.m.

Allegedly, on or about 03 June 2017, Louis Lagendijk sent:

...

I have not followed this thread closely, but did you check the harddisk cable. They are often a source for problems...

That's certainly true. Not just badly plugged in leads, but also ill-fitting connectors, and bent and folded SATA leads (there should be no abrupt angles in the SATA cable).

Tim

31 May 31 May

11:22 a.m.

Allegedly, on or about 31 May 2017, William Mattison sent:

...

I recall hearing and reading that the output of lithium batteries is almost flat (better than any other type of battery), but then very quickly drops (faster than any other type of battery) as it reaches end-of-life.

I can't say that I'm familiar with their discharge pattern, but I have read that an in-use lifespan of three years is considered normal. So, you're at the time it might be worth replacing, even if it's not the cause of current problems. At the very least, you stop this being a potential problem in another year or so.

...

Is there a Fedora command that I can use to check the hard drive (not the file systems) for bad blocks, sectors, tracks, etc? Is there a Fedora command that I can use to check the controller?

Look up S.M.A.R.T., though be aware that some controllers may not co-operate, but that tends to be things like outboard USB interfaces, or RAID. Ordinary hard drives plugged straight into the motherboard are likely to be checkable. It's the hard drive, itself, that checks its health and produces the stats, smartctl just gives you an interface.

...

Both problems occurred immediately after doing a "dnf upgrade". What is that telling us?

That you ought to try rebooting using a previous kernel, and see if problems persist.

There are two red flags about problems after doing an update:

1. That a new kernel has changed hardware drivers, or created other incompatibilities.

2. That your hard drive had some bad spots that hadn't been used before, but as you filled it up with more files (the recent downloads and installs), you hit the problem area.

Those are the two things that immediately jump to mind.

Yes, an update can be more stressful than other PC activities, for *some* users. But for other users, they're always subjecting their PC to a heavy workload, so a prolonged update session is nothing different from normal use.

...

Does "dnf upgrade" access the hard drive or the controller in a way that normal daily use does not?

I would say not. It's just files in and out, under the control of some program, onto storage system in the usual way.

...

When I bought the system 4+ years ago, I bought separate parts. This is a DIY desktop. I was advised to buy more power supply than needed. I did so. So unless the power supply is failing, I would think it's not a good candidate for the cause of the two problems. There have been no problems until this month, and I've been doing weekly patches since I got the system in 2013.

Power supplies do fail, sometimes gradually, sometimes spontaneously combusting, sometimes just randomly glitching. It can be complete coincidence that some technical failure happens at the same time as you did something you considered more special than it merely sitting there.

I agree with the concept of getting bigger than you think you need, but it's hard to work out the criteria. Few devices specify their power requirements, at all, or specify them adequately. i.e. A graphics card may say it needs a 100 watt power supply. That claim may be bogus, they may be overestimating so you buy an adequate one, it may be accurate. It doesn't specify how many watts it requires from the different supplies in your PC (12 volt, 5 volt, 3.3 volt, etc). So it could require a lot from a 12 volt supply, less from the 5 volt, and your power supply could be inadequate in one of those areas.

Then there's the power supply specs. Do they list the power it can continuously supply, the momentary higher peaks that it can supply? And there's a similar thing with the devices, does a graphic card's power supply requirements specify continuous and momentary peaks.

The momentary peaks, as something suddenly needs more power, as it turns on, or changes modes, etc., can be the kind of thing that cause enough trouble to make a system unstable.

If you have a simple system, e.g. motherboard, graphics card, hard drive, optical drive, it's not too hard to ensure you put in a sufficiently beefy supply. If you have a PC loaded with gadgets, it's harder to estimate the requirements.

But what type of power supply did you put in? Did you match the wattage your supplier said you needed, did you overcompensate by an extra 100 watts? Did you get some generic Chinese thing, or something that had a reputation?

As an opposing example: I stripped apart a friend's Mac, it has a ridiculously beefy power supply, with large fat bus bars that bolt to the motherboard, rather than those multi-pin molex connectors you see on the average PC. And that system is designed as a whole, so the manufacturer ought to know the full system specs, as opposed to a PC assembled from multiple different vendors who never collaborated.

...

I did watch the youtube that Tim provided. I don't recall seeing screws on the underside of the motherboard. I'll look again Friday or Saturday (God willing!).

I could see them on one of the videos, quite small silver ones, underneath the motherboard (you had to completely remove the board). But maybe they've switched to black ones, that need careful inspection to find.

I agree with the comments that ASUS made a prize design goof by burying the CMOS battery with that plating. I understand the value of covering the whole board (forcing cooling across it, making it harder for accidentally dropped things to land on exposed conductors, etc), but they should have left a way to easily access the battery.

William Mattison

9:47 p.m.

...

Look up S.M.A.R.T., though be aware that some controllers may not co-operate, but that tends to be things like outboard USB interfaces, or RAID. Ordinary hard drives plugged straight into the motherboard are likely to be checkable. It's the hard drive, itself, that checks its health and produces the stats, smartctl just gives you an interface.

Please see my reply to Rick.

...

That you ought to try rebooting using a previous kernel, and see if problems persist.

I did, and the problem showed up with all three of the latest f24 versions available in the grub menu.

...

Yes, an update can be more stressful than other PC activities, for *some* users. But for other users, they're always subjecting their PC to a heavy workload, so a prolonged update session is nothing different from normal use.

I don't understand what you're saying here. Both weekly patches went very quickly (I wish windows-7 were like that!) and with no errors reported in the output.

...

But what type of power supply did you put in? Did you match the wattage your supplier said you needed, did you overcompensate by an extra 100 watts? Did you get some generic Chinese thing, or something that had a reputation?

I did not figure out that part for myself. I got advice from a friend with decades of experience working for IBM's high performance division, and then for Cray research. The power supply is a Thermaltake TR2 600W. The system also has a Core i7-3770K @ 3.5GHz x 8, 16 GB memory, GeForce GTX 660 graphics card, an ASUS Xonar Essence STX audio card, a 2 TB hard drive, 2 blu-ray drives, keyboard, trackball, web cam (rarely plugged in), two 27-inch Dell monitors, and 2 small speakers. It's no gaming system, but a rather high-powered programming workstation by 2013 standards.

Thank-you, Bill.

Tim

1 Jun 1 Jun

6:21 a.m.

Tim:

...

...
Yes, an update can be more stressful than other PC activities, for *some* users. But for other users, they're always subjecting their PC to a heavy workload, so a prolonged update session is nothing different from normal use.

William Mattison:

...

I don't understand what you're saying here. Both weekly patches went very quickly (I wish windows-7 were like that!) and with no errors reported in the output.

You mentioned that the problems happened straight after doing a dnf update, and wondered if *that* process could have been the cause of the problems. I was pointing out that an update is no different than any other medium-duty processing the computer might do (a bit of heavy thinking when it processes dependencies, idling along as new files get downloaded, a bit of slighly heavy thinking as the packages are decompressed for a few moments before they get saved to disc).

...

...
But what type of power supply did you put in? Did you...

...

I did not figure out that part for myself. I got advice from a friend with decades of experience working for IBM's high performance division, and then for Cray research. The power supply is a Thermaltake TR2 600W. The system also has a Core i7-3770K @ 3.5GHz x 8, 16 GB memory, GeForce GTX 660 graphics card, an ASUS Xonar Essence STX audio card, a 2 TB hard drive, 2 blu-ray drives, keyboard, trackball, web cam (rarely plugged in), two 27-inch Dell monitors, and 2 small speakers. It's no gaming system, but a rather high-powered programming workstation by 2013 standards.

I would have thought 600 watts is more than sufficient for a general PC. If you look at what gamers do to their boxes (with their high end graphics cards and virtually a CPU farm in a box), it's staggering the amount of power that some PCs (allegedly) use. I'm sure they don't really use all that, but the short term peaks as things fire up, change modes, etc., can be a heck of a lot higher than their nominal power usage - those transients can trip up cheap and nasty supplies.

Thermaltake TR2 600W specs Maximum output capability 600 watts (no surprise, considering the model name, and I think they've got a good reputation).

ASUS Sabertooth Z77 Looks nice, but I see no power specs on their site. Though I see a review of that board with your processor that suggests up to 183 watts normally, add another 100 watts if overclocked.

GeForce GTX 660 specs Maximum power used by the card 140 watts Minimum system power supply recommendation 450 watts

Hmm, yeah, love their thinking there. Well, I supposed they're making an estimation of the likely power requirement of the rest of your system.

ASUS Xonar Essence STX Looks nice, a card without those wonky 3.5 mm jacks, and designed for sound quality. The kind of thing I might have gone for if I were buying new parts. No power specs, but I wouldn't think it's a major power hog.

Hard drives under 10 watts Blu-ray drives about 30 watts

Yes, sounds like a 600 watt supply should be fine. And your friend obviously has the background to figure that out, too.

So, if it's a power problem, that might be down to a fault rather than being an insufficient supply, in general.

Noting your other messages about hard drive errors, it may be that the drive itself is failing. Unrecoverable errors doesn't sound good, and have never bode well for the couple of drives I had with them. Though some people say that they can carry on using a drive with such bad sectors, if there's not many of them, and they're not increasing. Faults with "unrecoverable, uncorrectable, unreadable" types of errors are a big red flag.

The simple test is to try and write to the entire drive (which is easiest to do when wiping the entire contents, rather than filling up the space of an in-use drive), and see if that changes the error condition.

Such as, if it couldn't read the contents of something that was an interrupted write (such as a system crash, power failure, etc), on an undamaged portion of the drive, but could wipe and re-use that bit, suggests the drive will be okay. The error being caused externally.

But if it can't write and read those sectors, with a fresh attempt, that points the finger at the drive being at fault.

There are long and short SMART self tests that do these kinds of things. If you can afford to wipe the drive and test it, that may be the best way forward. If you go to your drive's manufacturer's site, they probably have a self-booting disc image to burn to test your drive (Seagate and Western Digital, at least, used to when I've done this in the dim and distant past).

If you can't do that, my suggestion is to buy a new hard drive, install a fresh OS onto it, and test drive your PC for a week or so.

-- [tim@localhost ~]$ uname -rsvp Linux 3.9.10-100.fc17.x86_64 #1 SMP Sun Jul 14 01:31:27 UTC 2013 x86_64 (always current details of the computer that I'm writing this email on) Boilerplate: All mail to my mailbox is automatically deleted, there is no point trying to privately email me, I only get to see the messages posted to the mailing list. The mindset of software designers: You know that feature that you, and many thousands of other users, found useful? We removed it, because we didn't like it. We also hard-coded the default settings that you keep customising.

William Mattison

2 Jun 2 Jun

9:11 p.m.

Well, the battery has been replaced this afternoon. It took between 2 and 2 1/2 hours. The system seems to be functioning ok so far, but I haven't yet booted up in windows-7, and I haven't yet tried a "dnf upgrade".

Before I took the system apart, I checked the CMOS clock and the voltages reported by the motherboard in the UEFI BIOS display: * CPU voltage varied, but was 0.98 +/- less than 0.01 volts. * "3.3V Voltage" was 3.392 volts. * "5V Voltage" was 5.040 volts. * "12V Voltage" was 12.096 volts. The CMOS clock seemed slightly slow compared to my "atomic" clock, but by less than 1 second. I gather none of the voltages displayed was the battery's voltage; I could not find a battery state indication in any of the BIOS displays. After the battery change was done, I checked the old battery with a battery tester. It was well in the "green range".

I agree with the criticisms about ASUS making the battery so difficult to access on this motherboard. I also found the USB 3.0 connector to be a problem. The pins were too crowded, too close to the socket wall, and too easily bent. I had to straighten out two of them, and it was difficult. Getting the plug into the socket took very careful and delicate alignment.

I hope to try the smartctl long test on the hard drive tomorrow.

thanks, Bill.

William Mattison

3 Jun 3 Jun

11:26 p.m.

The smartctl long test took about 4 hours (I think!). I wish it would notify me when it was actually finished! As best as I could tell (by using "smartctl -l error /dev/sda", it found no problems.

thanks, Bill.

Tim

5 Jun 5 Jun

12:57 a.m.

Allegedly, on or about 03 June 2017, William Mattison sent:

...

Before I took the system apart, I checked the CMOS clock and the voltages reported by the motherboard in the UEFI BIOS display:

CPU voltage varied, but was 0.98 +/- less than 0.01 volts.

"3.3V Voltage" was 3.392 volts.

"5V Voltage" was 5.040 volts.

"12V Voltage" was 12.096 volts.

Nothing to worry about with those fractional differences.

...

The CMOS clock seemed slightly slow compared to my "atomic" clock, but by less than 1 second.

Again, probably nothing to worry about. You may simply have a less accurate clock than other motherboards. They're comparable to a cheap wristwatch. Or it could be that the synchronisation routines on your installation aren't regularly poking into sync.

...

I gather none of the voltages displayed was the battery's voltage; I could not find a battery state indication in any of the BIOS displays.

If it was going to be anywhere, it'd be the same place as the other voltages. BIOSs often have some kind of "system health" page where various voltages, temperatures, and fan speed monitors show their results. They're usually all lumped together.

...

After the battery change was done, I checked the old battery with a battery tester. It was well in the "green range".

Probably okay. I'm never too trusting of those testers, though. For one thing, a battery needs to be tested under a load, but it's hard to make a simple tester that produces a suitable load for all batteries. Though at least you get some kind of indication between good and flat.

William Mattison

8 Jun 8 Jun

9:51 p.m.

I did my weekly patches this afternoon, and this time the system booted up fine. So I'm back to what caused the problems. * Motherboard battery? Quite unlikely, but not 100% certain. Battery replaced anyway. * Hard drive? Somewhat unlikely. Two 4-hour non-destructive disk checks found no issues. System cleaned; cables dis- and re-connected; hard drive removed and put back in; no kinky cables seen. Destructive testing and replacing the hard drive are not options for me at this time. Circumstances suggest such would be over-kill. * Somehow caused by the "dnf upgrade"? I can't assess this. After the second failure (May 25), I backed up all user data, and then upgraded from f24 to f25. I did not see any problems. This afternoon's patches were f25; the failures were f24. So I can no longer test whether f24 patching is at fault. But if it were, I'd be surprised if I were the only person to be hit by it. So my leaning is that it wasn't the patching that caused the problems. * Power supply? Somewhat unlikely. I know of no way to test this. But Tim's analysis and other circumstances suggest it's not worth pursuing this possibility any further.

Two questions: 1. Are there any other theories I should consider? 2. Should I submit a bugzilla? (If yes, against what?)

thanks, Bill.

Louis Lagendijk

9 Jun 9 Jun

12:06 p.m.

On Fri, 2017-06-09 at 02:51 +0000, William Mattison wrote:

...

I did my weekly patches this afternoon, and this time the system booted up fine.  So I'm back to what caused the problems.

Motherboard battery?  Quite unlikely, but not 100%

certain.  Battery replaced anyway.

Hard drive?  Somewhat unlikely.  Two 4-hour non-destructive disk

checks found no issues.  System cleaned; cables dis- and re- connected; hard drive removed and put back in; no kinky cables seen.  Destructive testing and replacing the hard drive are not options for me at this time.  Circumstances suggest such would be over-kill.

Somehow caused by the "dnf upgrade"?  I can't assess this.  After

the second failure (May 25), I backed up all user data, and then upgraded from f24 to f25.  I did not see any problems.  This afternoon's patches were f25; the failures were f24.  So I can no longer test whether f24 patching is at fault.  But if it were, I'd be surprised if I were the only person to be hit by it.  So my leaning is that it wasn't the patching that caused the problems.

Power supply?  Somewhat unlikely.  I know of no way to test

this.  But Tim's analysis and other circumstances suggest it's not worth pursuing this possibility any further.

Two questions:

Are there any other theories I should consider?

As I said before: harddisk cable. I have seen SATA cables fail. Or instead of the cable a bad contact repaired when you re-seated the cable.

...

Should I submit a bugzilla? (If yes, against what?)

No, this will not help if you don't know how to reproduce the fault. especially as this quit possibly was a hardware error solved by re- seating the cable..

...

Louis

William Mattison

11:24 p.m.

I think you're probably right on both counts. I thought so before my Thursday night post, but really thought it best to check with the experts.

thanks, Bill.

Tim

3:13 p.m.

Allegedly, on or about 09 June 2017, William Mattison sent:

...

Hard drive? Somewhat unlikely. Two 4-hour non-destructive disk

checks found no issues. System cleaned; cables dis- and re-connected; hard drive removed and put back in; no kinky cables seen. Destructive testing and replacing the hard drive are not options for me at this time. Circumstances suggest such would be over-kill.

Still possible. The unrecoverable errors may be on some part of the drive that you're not reading files from, they may not. Those errors could have been caused by the drive (surface faults, firmware faults), or external factors (fixed by reseating a cable, caused by some random glitch that hasn't happened again, etc).

...

Somehow caused by the "dnf upgrade"?

I'd be surprised if a changed version of some software caused that kind of error. Not surprised if the action of writing data (any file, no matter what it was) to a new areas of a drive could find a previously undetected fault.

In the past, we used to format and check drives before installing, to discover these little nasties. These days, mostly thanks to drives being huge and checks taking forever, the checking step gets omitted (it's no longer an option in the installer).

...

Power supply? Somewhat unlikely. I know of no way to test this.

But Tim's analysis and other circumstances suggest it's not worth pursuing this possibility any further.

Still possible to be a power supply problem. Power supplies can go bad. They can work normally under certain loads, then fail as loads increase (e.g. heavier CPU work). They can randomly glitch, switchmode power supplies are hardly the most reliable design.

A perfectly fine power supply could be glitched by external factors, such as mains power brown-outs, other equipment starting up (fridges, air-conditioners).

You're really only going to find the true cause if you can make it happen again.

Going from what I remember of the thread, the main problem was due to some unrecoverable read error on the drive. At some stage they were probably in an area of the drive that was read at boot-up. Sometimes a drive can eventually recover from them, it does try over-and-over, but usually it would have recovered straight away, if it could. If the drive still has those bad sections, if they can't be cleared by rewriting to the drive, the drive is probably the issue.

If you have a spare drive, I'd put it in and give the old one a thorough test.

William Mattison

11:19 p.m.

It's believed that the main problems were i-node problems identified by "fsck" during boot. The first time, they were on sda6; the second time, they were on sda7.

A few follow-up questions about the hard drive... I used the long but non-destructive test options of both "badblocks" and "smartctl". They each scan the entire hard drive, right? If "dnf upgrade" were writing to new areas of the disk, and those areas were bad, 1. those writes would have failed, and in turn have caused the "dnf upgrade" to fail, right? 2. the "smartctl" and "badblocks" tests would have found and reported those bad areas right?

I only have one system, and only one hard drive, and no money to buy.

The hardware work was done between fixing the second occurrence of the problem and the upgrade from f24 to f25. That upgrade would have done a lot more disk writing (and reading?) than did the two f24 weekly patches that preceded the boot failures. This suggests - merely suggests - to me that the hardware work (re-doing cable connections, cleaning, etc.) fixed the problem rather than it being a problem within the hard drive.

thanks, Bill.

William

13 Jun 13 Jun

10:43 a.m.

New subject: post-mortem: f24 boot fails; need help. [CLOSED]

Good morning,

I'm closing this thread. I think we've done what we realistically can to determine what caused the fsck errors that led to the two boot failures. I certainly learned a few things along the way. I've also discovered a few things I should be doing that I wasn't doing before.

I thank each of you who tried to help for your time and effort; you were a good help. Bill.

On 05/24/2017 09:38 AM, William wrote:

...

Good morning,

The "f24 boot fails; need help" problem set me back a week. I'm still catching up. I seriously believe it would be foolish for me to just forget it. I should for the benefit of others try to get at the real cause and possible prevention.

A few hours before the failure, I received and looked at an e-mail that I'm almost certain was at least a spoof, and possibly malicious. I know it contained html and links. I did *** not *** click any of the links. I looked at it, and deleted it. It was viewed in Thunderbird only. The message's "From" ended with "yahoo.com". My question: It is highly improbable that that message had anything to do with the boot failure. Am I correct?

Also a few hours before the failure, I did some web browsing using Firefox with NoScript and uBlock Origin. As best as I recall, the "riskiest" sites that I visited were finance.yahoo.com (and a few of its sub-pages, I clicked no ads, no ad links) and indeed.com (possibly and a posting or two). My question: It is highly improbable that my web browsing had anything to do with the boot failure. Am I correct?

thanks, Bill.

2519

Age (days ago)

2539

Last active (days ago)

users@lists.fedoraproject.org

32 comments

9 participants

tags (0)

participants (9)

Joe Zeff
Joseph Loo
Louis Lagendijk
Rick Leir
Rick Stevens
Sam Varshavchik
Tim
William
William Mattison