Note: This SMART Hard Drive Health Monitoring tutorial only applies to physical servers not VPS (Virtual Private Servers). Physical servers have dedicated physical hardware.


Introductionto SMART

If your seeing strange behavior from your server chances are it may have to do with your hard drives. You may be experiencing any of the following symptoms:

  • software doesn’t work properly
  • you notice lots of errors in the logs
  • some database’s tables get corrupted too often
  • unexpected high system loads
  • loss of network traffic

… Just to name a few.

When that happens, one of the first thing to do is check the health of your hard drives. There are many ways to test disk health in Linux, but for this tutorial we will use the most common one called SMART.

S.M.A.R.T – (Self-Monitoring, Analysis and Reporting Technology); often written as
SMART
is a monitoring system pre installed on your hard drives (HDDs) and solid-state drives (SSDs). SMART detects and reports the heath and reliability of your drives with the intent to notify you before your drive fails.

We can use smartmontools to monitor our storage systems using SMART. It’s supported for all modern OS’s like (Linix, FreeBSD, Windows).

smartmontools contains two utility programs (smartctl and smartd). In many cases, these utilities will provide advanced warning of disk degradation and failure.

Smartmontools detects problems with:

  • temperature
  • mechanic
  • damage (physical and logic errors)
  • electric part

Now that you’ve learned what smarttools do lets go ahead and get it installed and show you how to use it. Here are the most common and useful smartmontools commands.


Install It

yum -y install smartmontools

Run It

To see how many hard drives are in your system you can run:

blkid

[root@jupiter]# blkid
 /dev/sda3: LABEL="SWAP-sda3" TYPE="swap"
 /dev/sda2: UUID="42a78336-f545-423e-8c79-7b50e024a5b4" TYPE="ext3"
 /dev/sda1: UUID="16624408-84c1-460c-8663-09d4860e6f21" TYPE="ext3"
 /dev/md0: UUID="16624408-84c1-460c-8663-09d4860e6f21" TYPE="ext3"
 /dev/md1: UUID="42a78336-f545-423e-8c79-7b50e024a5b4" TYPE="ext3"

or

export LANG=en_EN; fdisk -l 2>/dev/null| egrep 'Disk /dev/[sh]d

[root@jupiter]# export LANG=en_EN; fdisk -l 2>/dev/null| egrep 'Disk /dev/[sh]d'
Disk /dev/sda: 250.0 GB, 250059350016 bytes

As you can see, your server only has one hard drive named /dev/sda.


Enable SMART

To see if your hard drive support SMART, run:

smartctl -i /dev/sda

Enable it if it was disabled
smartctl -s on /dev/sda

By running the command above you may get some like this:

[root@proxy]# smartctl -A /dev/sda
 smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
 Home page is http://smartmontools.sourceforge.net

 SMART Disabled. Use option -s with argument 'on' to enable it.

Get All The Disk Info.

To get all the information about your hard disks, run:

smartctl -a /dev/sda

To show short output (only values) run:

smartctl -A /dev/sda

NOTE: Both commands have long outputs. I’ve gown ahead and highlighted in bold the most important things to look for . Also, you should know about the attributes, that show “heath” for some of the parameter.

  • Each vendor have its own recommendation for these values
  • Detailed information about all attributes you can find on Wikipedia, about S.M.A.R.T._attributes.

You can use the following rules below. (I’ve verified then on a real server).

The most important ones are:

 

  • 5 Reallocated_Sector_Ct,
  • 196 Reallocated_Event_Count,
  • 197 Current_Pending_Sector

 

 

Good values for those settings are 0.

#5 and #196 may have nonzero values (for example 55) and it will work fine, just monitor it to ensure that the values don’t increase. (HINT: check it before running a long test and after, or from time to time using monitoring system like Zabbix/Nagios). For most hosting companies value 25 is enough to them to replace the disks.

#197 is the most critical value, if it shows 1 or more – you should replace this disk as fast as possible!

Let’s look at a real example below. I’ve highlighted the important parts.

 [root@jupiter]# smartctl -a /dev/sda
 smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
 Home page is http://smartmontools.sourceforge.net/

 === START OF INFORMATION SECTION ===
 Device Model: SAMSUNG HD251HJ
 Serial Number: S13QJ90S205235
 Firmware Version: 1AC01113
 User Capacity: 250,059,350,016 bytes
 Device is: In smartctl database [for details use: -P show]
 ATA Version is: 8
 ATA Standard is: ATA-8-ACS revision 3b
 Local Time is: Wed Sep 16 00:06:01 2015 EEST

 ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled

 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 General SMART Values:
 Offline data collection status: (0x00) Offline data collection activity
 was never started.
 Auto Offline Data Collection: Disabled.
 Self-test execution status: ( 248) Self-test routine in progress...
 80% of test remaining.
 Total time to complete Offline
 data collection: (3297) seconds.
 Offline data collection
 capabilities: (0x7b) SMART execute Offline immediate.
 Auto Offline data collection on/off support.
 Suspend Offline collection upon new
 command.
 Offline surface scan supported.
 Self-test supported.
 Conveyance Self-test supported.
 Selective Self-test supported.
 SMART capabilities: (0x0003) Saves SMART data before entering
 power-saving mode.
 Supports SMART auto save timer.
 Error logging capability: (0x01) Error logging supported.
 General Purpose Logging supported.
 Short self-test routine
 recommended polling time: ( 2) minutes.
 Extended self-test routine
 recommended polling time: ( 56) minutes.
 Conveyance self-test routine
 recommended polling time: ( 7) minutes.
 SCT capabilities: (0x003f) SCT Status supported.
 SCT Feature Control supported.
 SCT Data Table supported.

 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
 3 Spin_Up_Time 0x0007 094 094 011 Pre-fail Always - 2730
 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 65
 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
 7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 9587
 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 15404
 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0
 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 65
 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0
 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 2
 184 Unknown_Attribute 0x0033 100 100 000 Pre-fail Always - 0
 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
 190 Airflow_Temperature_Cel 0x0022 080 069 000 Old_age Always - 20 (Lifetime Min/Max 19/23)
 194 Temperature_Celsius 0x0022 079 067 000 Old_age Always - 21 (Lifetime Min/Max 19/25)
 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 17016028
 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 6
 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0
 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0

 SMART Error Log Version: 1
 No Errors Logged

 SMART Self-test log structure revision number 1
 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
 # 1 Extended offline Self-test routine in progress 80% 15404 -
 
 SMART Selective self-test log data structure revision number 1
 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
 1 0 0 Not_testing
 2 0 0 Not_testing
 3 0 0 Not_testing
 4 0 0 Not_testing
 5 0 0 Not_testing
 Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
 If Selective self-test is pending on power-up, resume after 0 minute delay.

 


Run A “long test”

There are a few options for testing, but we will use the “long test” because it’s the most useful. Run:

smartctl -t long /dev/sda

Note: test time will depend on the size of your disk. The time to end shows in a percentage, and counts down from 100% to 0%.

[root@jupiter]# smartctl -t long /dev/sda
 smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
 Home page is http://smartmontools.sourceforge.net
 === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
 Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
 Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
 Testing has begun.
 Please wait 56 minutes for test to complete.
 Test will complete after Wed Sep 16 00:58:04 2015
 Use smartctl -X to abort test.

For any test “in progress”, you can find the results in the previous output page.

After the test is finished, check the “Status” and the “Attributes” for the information you need.

Below is an example of a normal disk:

[root@jupiter]# smartctl -a /dev/sda
...
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   094   094   011    Pre-fail  Always       -       2730
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       65
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       9587
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       15540
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       65
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       2
184 Unknown_Attribute       0x0033   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   076   069   000    Old_age   Always       -       24 (Lifetime Min/Max 19/25)
194 Temperature_Celsius     0x0022   075   067   000    Old_age   Always       -       25 (Lifetime Min/Max 19/26)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       118939301
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       6
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0
...
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     15408         -

Hints

The following command will display all the important values in a short list:

smartctl -a -d sat /dev/sda | egrep '^(Serial|Device M| 5|196|197|User|# [1-9])|[1-9]0%'

 

[root@jupiter]# smartctl -a -d sat /dev/sda | egrep '^(Serial|Device M|  5|196|197|User|# 1])|[1-9]0%'
 Device Model: SAMSUNG HD251HJ
 Serial Number: S13QJ90S205235
 User Capacity: 250,059,350,016 bytes
 80% of test remaining.
 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
 # 1 Extended offline Self-test routine in progress 80% 15404 -

 

For illustration, I’ve taken the following from a real server with failing disks:

...
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 15849 151732779 <<---- (this one)
# 2 Extended offline Completed without error 00% 13679 -
...

To ensure that the hard drive should be replace, Take note of the following

dd if=/dev/sdN of=/dev/null skip=SECTOR count=1

dd if=/dev/sda of=/dev/null skip=151732779 count=1
dd: /dev/sda: Input/output error
0+0 records in
0+0 records out

Note: Based on the information above any additional load on the disk may brake it more! This information is enough to ensure your hosting provider or support staff replace the disk!

Sometimes the “long test” may hang for a long time (fox example at 10%). In this case, you may want to check that the disk doesn’t have any problems.

You can use a utility like dd but remember, this test will take a long time. DD will read your drive block by block looking for errors.

Oneline helpfull command:

DISK=sd?; dd if=/dev/$DISK of=/dev/null bs=128K >> /root/dd_$DISK.log 2> /root/dd_$DISK.log &

and when it finished (ps aux | grep 'dd ') just look at the results:

cat /root/dd_*.log


Conclusion

This is not all the information about the SMART system but after reading this tutorial you have seen the most common situations with real world examples. You now have enough knowledge to check the health of your disk.

More detailed information can be found at wiki about SMART and smartmontools website.

To keep your data safety use a software raid and always make backups!



Tags: , , , , ,

Spin up a cloud server in no time flat

Simple setup. Full root access. Straightforward pricing.


DEPLOY SERVER




Leave a Reply