BIP KB:
How To Check Your Hard Drive Health With S.M.A.R.T.
Article By alexander
Note: This SMART Hard Drive Health Monitoring tutorial only applies to physical servers not VPS (Virtual Private Servers). Physical servers have dedicated physical hardware.
Introduction to SMART
If your seeing strange behavior from your server chances are it may have to do with your hard drives. You may be experiencing any of the following symptoms:
- software doesn't work properly
- you notice lots of errors in the logs
- some database's tables get corrupted too often
- unexpected high system loads
- loss of network traffic
... Just to name a few. When that happens, one of the first thing to do is check the health of your hard drives. There are many ways to test disk health in Linux, but for this tutorial we will use the most common one called SMART.
S.M.A.R.T - (Self-Monitoring, Analysis and Reporting Technology); often written as SMART is a monitoring system pre installed on your hard drives (HDDs) and solid-state drives (SSDs). SMART detects and reports the heath and reliability of your drives with the intent to notify you before your drive fails. We can use smartmontools to monitor our storage systems using SMART. It's supported for all modern OS's like (Linix, FreeBSD, Windows). smartmontools contains two utility programs (smartctl and smartd). In many cases, these utilities will provide advanced warning of disk degradation and failure. Smartmontools detects problems with:
- temperature
- mechanic
- damage (physical and logic errors)
- electric part
Install It
yum -y install smartmontools
Run It
To see how many hard drives are in your system you can run:blkid
[root@jupiter]# blkid /dev/sda3: LABEL="SWAP-sda3" TYPE="swap" /dev/sda2: UUID="42a78336-f545-423e-8c79-7b50e024a5b4" TYPE="ext3" /dev/sda1: UUID="16624408-84c1-460c-8663-09d4860e6f21" TYPE="ext3" /dev/md0: UUID="16624408-84c1-460c-8663-09d4860e6f21" TYPE="ext3" /dev/md1: UUID="42a78336-f545-423e-8c79-7b50e024a5b4" TYPE="ext3"
or
export LANG=en_EN; fdisk -l 2>/dev/null| egrep 'Disk /dev/[sh]d
[root@jupiter]# export LANG=en_EN; fdisk -l 2>/dev/null| egrep 'Disk /dev/[sh]d' Disk /dev/sda: 250.0 GB, 250059350016 bytes
As you can see, your server only has one hard drive named /dev/sda.
Enable SMART
To see if your hard drive support SMART, run:
smartctl -i /dev/sda
Enable it if it was disabled
smartctl -s on /dev/sda
By running the command above you may get some like this:
[root@proxy]# smartctl -A /dev/sda smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net SMART Disabled. Use option -s with argument 'on' to enable it.
Get All The Disk Info.
To get all the information about your hard disks, run:
smartctl -a /dev/sda
To show short output (only values) run:
smartctl -A /dev/sda
NOTE: Both commands have long outputs. I've gown ahead and highlighted in bold the most important things to look for . Also, you should know about the attributes, that show "heath" for some of the parameter.
- Each vendor have its own recommendation for these values
- Detailed information about all attributes you can find on Wikipedia, about S.M.A.R.T._attributes.
- 5 Reallocated_Sector_Ct,
- 196 Reallocated_Event_Count,
- 197 Current_Pending_Sector
Good values for those settings are 0. #5 and #196 may have nonzero values (for example 55) and it will work fine, just monitor it to ensure that the values don't increase. (HINT: check it before running a long test and after, or from time to time using monitoring system like Zabbix/Nagios). For most hosting companies value 25 is enough to them to replace the disks. #197 is the most critical value, if it shows 1 or more - you should replace this disk as fast as possible! Let's look at a real example below. I've highlighted the important parts.
[root@jupiter]# smartctl -a /dev/sda smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HD251HJ Serial Number: S13QJ90S205235 Firmware Version: 1AC01113 User Capacity: 250,059,350,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Wed Sep 16 00:06:01 2015 EEST ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 248) Self-test routine in progress... 80% of test remaining. Total time to complete Offline data collection: (3297) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 56) minutes. Conveyance self-test routine recommended polling time: ( 7) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 094 094 011 Pre-fail Always - 2730 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 65 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 9587 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 15404 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 65 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 2 184 Unknown_Attribute 0x0033 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 080 069 000 Old_age Always - 20 (Lifetime Min/Max 19/23) 194 Temperature_Celsius 0x0022 079 067 000 Old_age Always - 21 (Lifetime Min/Max 19/25) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 17016028 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 6 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Self-test routine in progress 80% 15404 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Run A "long test"
There are a few options for testing, but we will use the "long test" because it's the most useful. Run:
smartctl -t long /dev/sda
Note: test time will depend on the size of your disk. The time to end shows in a percentage, and counts down from 100% to 0%.
[root@jupiter]# smartctl -t long /dev/sda smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 56 minutes for test to complete. Test will complete after Wed Sep 16 00:58:04 2015 Use smartctl -X to abort test.
For any test "in progress", you can find the results in the previous output page. After the test is finished, check the "Status" and the "Attributes" for the information you need. Below is an example of a normal disk:
[root@jupiter]# smartctl -a /dev/sda ... ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 094 094 011 Pre-fail Always - 2730 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 65 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 9587 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 15540 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 65 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 2 184 Unknown_Attribute 0x0033 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 076 069 000 Old_age Always - 24 (Lifetime Min/Max 19/25) 194 Temperature_Celsius 0x0022 075 067 000 Old_age Always - 25 (Lifetime Min/Max 19/26) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 118939301 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 6 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0 ... SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 15408 -
Hints
The following command will display all the important values in a short list:
smartctl -a -d sat /dev/sda | egrep '^(Serial|Device M| 5|196|197|User|# [1-9])|[1-9]0%'
[root@jupiter]# smartctl -a -d sat /dev/sda | egrep '^(Serial|Device M| 5|196|197|User|# 1])|[1-9]0%' Device Model: SAMSUNG HD251HJ Serial Number: S13QJ90S205235 User Capacity: 250,059,350,016 bytes 80% of test remaining. 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 # 1 Extended offline Self-test routine in progress 80% 15404 -
... Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 15849 151732779 <<---- (this one) # 2 Extended offline Completed without error 00% 13679 - ...
To ensure that the hard drive should be replace, Take note of the following
dd if=/dev/sdN of=/dev/null skip=SECTOR count=1
dd if=/dev/sda of=/dev/null skip=151732779 count=1 dd: /dev/sda: Input/output error 0+0 records in 0+0 records out
Note: Based on the information above any additional load on the disk may brake it more! This information is enough to ensure your hosting provider or support staff replace the disk!
Sometimes the "long test" may hang for a long time (fox example at 10%). In this case, you may want to check that the disk doesn't have any problems.
You can use a utility like dd but remember, this test will take a long time. DD will read your drive block by block looking for errors.
Oneline helpfull command:
DISK=sd?; dd if=/dev/$DISK of=/dev/null bs=128K >> /root/dd_$DISK.log 2> /root/dd_$DISK.log &
and when it finished (ps aux | grep 'dd '
) just look at the results:
cat /root/dd_*.log
Conclusion
This is not all the information about the SMART system but after reading this tutorial you have seen the most common situations with real world examples. You now have enough knowledge to check the health of your disk. More detailed information can be found at wiki about SMART and smartmontools website. To keep your data safety use a software raid and always make backups!
Tags: education, disk, HDD, health, smart, smartctl
Spin Up A VPS Server In No Time Flat
Simple Setup
Full Root Access
Straightforward Pricing
DEPLOY A SECURE VPS SERVER TODAY!Leave a Reply
Feedbacks
![]() This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. |