1Inside HDD LogoHardware RAID and consumer HDD problems

A couple of years ago I was lucky enough to have the oppor­tun­ity to pur­chase a second hand hard­ware raid 5 card (an Areca 1220) for a very low price. Since then I have used the card along with sev­er­al sets of drives (ori­gin­ally 300gig, then 750gig and cur­rently 1Tb) in a ded­ic­ated serv­er PC as a large net­work file-store for the fam­ily’s music, pho­tos, videos and back-ups.

Before I obtained the hard­ware card I tried using soft­ware raid, but found the res­ults very dis­ap­point­ing. The serv­er has a low power, single core cpu which isn’t really up to the task of act­ing as a raid‑5 engine. Whilst I’ve heard plenty of times that RAID isn’t a back-up, this is a case where only a cheap solu­tion will do. RAID-5 offers pro­tec­tion from single drive fail­ure, which is good enough for my pur­poses. The ded­ic­ated card offers an enorm­ous per­form­ance advant­age, but in prac­tice this isn’t very import­ant. The fea­tures it adds how­ever are! The Areca card offers an OS inde­pend­ent raid solu­tion which counts for a lot. It also offers online capa­city expan­sion and raid-level migra­tion (so, for example, I could upgrade to raid‑6). Both of these fea­tures are much less simple with cheap­er solutions.

So, you might think, what’s the prob­lem. The answer: the lack of options from Hard Disk manufacturers…

Ever since using the Areca card I have suffered from occa­sion­al drive “fail­ures”. Upon power­ing off and on the drive reappears as fully func­tion­al. I then have to spend many hours rebuild­ing the array from degraded back to nor­mal. After much search­ing I have dia­gnosed the prob­lem, but am unable to prop­erly solve it.

Hard Drive man­u­fac­tur­ers provide a range of drives for dif­fer­ent pur­poses. The typ­ic­al drives most of us buy are con­sumer level drives. The man­u­fac­tur­ers also offer enter­prise-class drives designed for serv­ers which have intens­ive use pat­terns and 24.7 uptime. These drives are often phys­ic­ally identic­al, but have under­gone addi­tion­al test­ing and are sup­plied with slightly dif­fer­ent firm­ware, optim­ised for serv­er workloads.

One of these fea­tures is Error Recov­ery Con­trol (ERC). This fea­ture is also called CCTL (Com­mand Com­ple­tion Time Lim­it) by Sam­sung and Hita­chi and TLER (Time-Lim­ited Error Recov­ery) by West­ern Digit­al. All drives suf­fer the occa­sion­al error at a phys­ic­al level, which could be caused by things like stray cos­mic rays. These errors are handled by redund­ancy built into the way the drive stores data, but occa­sion­ally one can be severe enough to cause prob­lems read­ing data. Nor­mal con­sumer drives will spend a pro­longed peri­od attempt­ing to read the dam­aged data to recov­er it. They then map it to a new part of the drive and everything con­tin­ues as nor­mal. How­ever, this delay can cause severe prob­lems in enter­prise envir­on­ments, so enter­prise drives will time-out their self-repair attempts after a short peri­od (usu­ally 7 seconds or so) and report the error to the raid con­trol­ler. The raid con­trol­ler then handles the error by recal­cu­lat­ing the data using the oth­er drives in the array. This pre­vents large delays in send­ing data, but requires the pres­ence of oth­er drives and a raid controller.

So, I have a prop­er hard­ware raid card. It expects to hear back from drives with­in no more than 7–8 seconds regard­less of an error. I also have con­sumer hard drives, which attempt to repair their own errors for a long peri­od. So when an error occurs the drive tries to fix it, does­n’t respond with­in 7–8 seconds, and the raid con­trol­ler than assumes the drive has failed and kicks it out of the array.

So, the obvi­ous solu­tions would be either to tell the raid con­trol­ler to wait longer without kick­ing a drive out, OR tell the drive to give up after 7 seconds like an enter­prise drive… Infuri­at­ingly, neither is possible!

I have searched extens­ively, but I can­’t find any prop­er raid‑5 cards which allow the user to change how long they will wait for a drive. In the past there were some WD drives which could have the TLER fea­ture enabled with a util­ity released by WD called WD-TLER, but recently WD have dis­abled this option, pre­sum­ably to “pro­tect” the huge markup on their enter­prise drives (which are double the price for the same hardware)

Some people have found ways to tem­por­ar­ily enable ERC on some drives using either HDAT2, Smart­CTL or hdparm, how­ever these do not sup­port my RAID card under Win­dows, and the change is lost if the PC is power cycled.

For users like myself that need a large capa­city stor­age, and the fea­tures offered by a hard­ware raid‑5 solu­tion, but that do not need 24.7 uptime, long war­ranties or drives designed for heavy duty usage there is cur­rently NO appro­pri­ate solu­tion. Its about time either a drive man­u­fac­turer addressed this mar­ket (by releas­ing a con­sumer drive with ERC enabled for a small, e.g. 15%, premi­um) or a raid-card man­u­fac­turer addressed the mar­ket by offer­ing a card with the option to increase the time before drives are timed out. Cre­at­ing either of these solu­tions is trivi­al, a simple firm­ware tweak would do the job.

Until then, I advise oth­ers to avoid using hard­ware raid cards with con­sumer drives, and giv­en the price premi­um of enter­prise drives I recom­mend­ing avoid­ing hard­ware raid altogether.

Found this useful? Please do let us know by dropping a comment below. If you would like to subscribe please use the subscribe link on the menu at the top right. You can also share this with your friends by using the social links below. Cheers.

Leave a Reply

One Comment