119 lines
7.6 KiB
Markdown
119 lines
7.6 KiB
Markdown
---
|
|
title: "Hardware issues summer investigations write-up"
|
|
date: 2019-07-28
|
|
url: hardware-issues-summer-investigations-write-up
|
|
layout: post
|
|
category: Articles
|
|
image: /img/blog/hardware-issues-summer-investigations-write-up_1.jpg
|
|
description: "When you look very hard for a weird issue that you are not even the cause of..."
|
|
---
|
|
|
|
[![A missing blog post image](/img/blog/hardware-issues-summer-investigations-write-up_1.jpg)](/img/blog/hardware-issues-summer-investigations-write-up_1.jpg)
|
|
|
|
### Introduction
|
|
|
|
It's summer, [heat is real](https://www.pressenza.com/2019/07/france-records-hottest-temperature-ever-in-european-heat-wave/), IT is having a hard time.
|
|
I've heard many of us lost some hard disks (**DO BACKUPS**), but it'll be about graphics cards there.
|
|
|
|
As you may already know, I've got [a pretty old desktop setup](https://mysetup.co/setups/1348948207-Home-Setup), but ecology is also about not changing every 6 months all the tools that are still working :sweat_smile:
|
|
|
|
Let's debrief passed weeks lack of understanding about a very weird issue occurring more and more often, but **not systematically**. See below.
|
|
|
|
### Symptoms
|
|
|
|
Symptoms were rather simple : **Sometimes**, the screen just went black and the PC hard rebooted therefore.
|
|
|
|
Rarely, some graphical glitches may be experienced just before that.
|
|
They were weird artifacts, then more and more, until the screen was unreadable, and at this very moment, it went dark.
|
|
|
|
Quickly, I've understood that it happened mostly when the PC was working hard, so typically when a LoL game was loading (enjoy uncapped FPS on splash-screens in 2019), or when loading a bloat-ware Web application.
|
|
As you surely guessed, that caused me MANY LP (League Points) losses in ranked lately, so I demoted hard. But that's a detail.
|
|
|
|
### Diagnostic
|
|
|
|
#### Hard drives
|
|
|
|
As one of the hard disks (RAID) was making a little weird noise, I've thought about it at first.
|
|
So I installed and run some quick [Smartmontools](https://www.smartmontools.org/) checks, at least to verify S.M.A.R.T. data.
|
|
|
|
Everything was looking OK, so I moved to memory.
|
|
|
|
#### Memory
|
|
|
|
The unpredictable and unrepairable aspects of the trouble were making me think about memory sticks issues.
|
|
For this, I've used the incredible [Memtest](https://www.memtest86.com/) tool, but some false positives made me lose some hours of extensive testing.
|
|
|
|
As their latest version is compatible with UEFI only, I had to go through their previous "stable" one (Why on Earth someone would remove the regular legacy BIOS support ?? :thinking:), pretty buggy.
|
|
|
|
So, once the 10th check has been manually withdrawn from the testing suite, I ended up understanding that it **was not** a memory-related issue, but it was not really a good news, as I didn't have any more leads...
|
|
|
|
### Thanks _Windows_
|
|
|
|
> Are you mad ?
|
|
|
|
Well, it may look like, but I'm not.
|
|
_Windows_ (for once) helped me hard on this, as after each crash/reboot, its "Find and repair errors" tool indicated what happened, and it was... BSODs !
|
|
|
|
> Whut ? Didn't you see this ? It's pretty explicit...
|
|
|
|
Actually, blue screens were never shown by the system, maybe 'cause **it was definitely about the graphics card since the beginning** ?
|
|
|
|
The _Windows_ reporting tool indicated two kinds of BSOD reasons :
|
|
|
|
* BCCode : 0x116 \[VIDEO_TDR_ERROR\]
|
|
|
|
* BCCode : 0x0C5 \[DRIVER_CORRUPTED_EXPOOL\] (If I remember correctly)
|
|
|
|
When the `0x116` popped-out, I decided to pursue investigations onto the graphics card.
|
|
I landed on [this page](https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x116---video-tdr-error), and I have to admit it, new _Micro$oft_ documentations are pretty clear and straightforward (when their website is not down, though).
|
|
So it looks like the GPU couldn't do its job on time for whatever reason (Disclaimer : I'm not a hardware expert at all), and the kernel ends up raising those fatal errors.
|
|
|
|
### Not thanks to AMD
|
|
|
|
I (currently) got a [Sapphire](http://www.sapphiretech.com/) [Radeon HD 7770](https://en.wikipedia.org/wiki/Radeon_HD_7000_Series), GHz Edition (it's actually over-clocked, from 1GHz to 1.15GHz).
|
|
Although, it's a re-branded chipset from the [R7 200 Series](https://en.wikipedia.org/wiki/AMD_Radeon_Rx_200_series). Pretty old device isn't it ? :sunglasses:
|
|
|
|
Those hardwares are actually driven by an AMD software called `AMD Settings` on your disk, that replace the old `ATI CCC`, standing for `Catalyst Control Center` back in the past.
|
|
Actually, what a coincidence, [a brand new one (Adrenalin) has been available for some weeks now](https://www.amd.com/fr/technologies/radeon-software) !
|
|
|
|
The client is pretty cool, information are relatively clear, and, once you know that advanced hardware tweaks are under `Gaming > Global Settings > Global OverDrive`, you're good to go.
|
|
|
|
> By the way : If you wanna disable the [Enhanced V-Sync](https://www.amd.com/en/technologies/radeon-software-enhancedsync) feature ~~that may break your in-game experience~~, it's under `Gaming > Global Settings > Wait for Vertical Refresh` (pretty hidden option, in my honest opinion) :wink:
|
|
|
|
By looking at those graphs, I quickly got it : The GPU was not cooling as it should, and the temperature could easily reach \~70°C in game :roll_eyes:
|
|
|
|
[![A missing blog post image](/img/blog/hardware-issues-summer-investigations-write-up_2.png)](/img/blog/hardware-issues-summer-investigations-write-up_2.png)
|
|
|
|
At this point, I wondered whether the lack of cooling could be the source of the issues, and... it definitely looks like.
|
|
|
|
So my final _workaround_ was to lower the GPU and the GPU memory clocks frequencies back to 1000MHz, as long as unlocking the fan speed control, allowing me to **drastically higher it** (it has been manually turned up around \~70%).
|
|
Although, how the hell a graphics card, that is sold for being over-clockable, is over-clocked by default ? :fearful:
|
|
|
|
> "Power to the users !" :punch:
|
|
|
|
What I discovered lately too : when your AMD driver crashes (yeah, that happens too...), at reboot, you may encounter something like _"Default settings have been restored [...] due to [...] WattMan [...] crash"_.
|
|
This also means that **your previous hardware settings** have been reseted.
|
|
So... I'd advise you to export your settings to JSON (`Preferences > Export Settings...`), so as to import them back right when it's needed, and before running a resources-consuming program.
|
|
|
|
### Addendum (for a little laugh)
|
|
|
|
Before landing to the conclusion, one question for you : Do you know a software in 2019, once its self-updating process is finished, that opens up your browser to load a (metrics) specific page ?
|
|
|
|
`https://subscriptions.amd.com/driverinstalled/index.html?pt=radeon&VID1=XXXX&DID1=YYYY&PID1=AMD Radeon Graphics Processor&SSVID1=ZZZZ&SSID1=TTTT&os=YOUR_OS&osbit=64&cpu=YOUR_CPU_MODEL_PARTLY_URL_ENCODED`
|
|
|
|
That's a pretty strange way of performing analytics, isn't it ? :joy:
|
|
|
|
I even wonder why they would need _jQuery_ and almost four thousands lines of third JavaScript to display such a content :thinking:
|
|
|
|
[![A missing blog post image](/img/blog/hardware-issues-summer-investigations-write-up_3.png)](/img/blog/hardware-issues-summer-investigations-write-up_3.png)
|
|
|
|
### Conclusion
|
|
|
|
**TL;DR** : Don't trust the AMD's Radeon Softwares, that are apparently unable to determine the GPU fan speed required for a proper cooling.
|
|
|
|
I'm looking forward to reading any feedback that you would like to share with the world :earth_africa:
|
|
|
|
Bye :wave:
|
|
|
|
> Post header image made by [Artiom Vallat](https://unsplash.com/@virussinside) and shared on [Unsplash](https://unsplash.com/photos/1uBCLmu5BqA).
|