1Inside HDD LogoHardware RAID and consumer HDD problems

A couple of years ago I was lucky enough to have the oppor­tun­ity to pur­chase a second hand hard­ware raid 5 card (an Areca 1220) for a very low price. Since then I have used the card along with sev­er­al sets of drives (ori­gin­ally 300gig, then 750gig and cur­rently 1Tb) in a ded­ic­ated serv­er PC as a large net­work file-store for the fam­ily’s music, pho­tos, videos and back-ups.

Before I obtained the hard­ware card I tried using soft­ware raid, but found the res­ults very dis­ap­point­ing. The serv­er has a low power, single core cpu which isn’t really up to the task of act­ing as a raid‑5 engine. Whilst I’ve heard plenty of times that RAID isn’t a back-up, this is a case where only a cheap solu­tion will do. RAID-5 offers pro­tec­tion from single drive fail­ure, which is good enough for my pur­poses. The ded­ic­ated card offers an enorm­ous per­form­ance advant­age, but in prac­tice this isn’t very import­ant. The fea­tures it adds how­ever are! The Areca card offers an OS inde­pend­ent raid solu­tion which counts for a lot. It also offers online capa­city expan­sion and raid-level migra­tion (so, for example, I could upgrade to raid‑6). Both of these fea­tures are much less simple with cheap­er solutions.

So, you might think, what’s the prob­lem. The answer: the lack of options from Hard Disk manufacturers…

Ever since using the Areca card I have suffered from occa­sion­al drive “fail­ures”. Upon power­ing off and on the drive reappears as fully func­tion­al. I then have to spend many hours rebuild­ing the array from degraded back to nor­mal. After much search­ing I have dia­gnosed the prob­lem, but am unable to prop­erly solve it.

Hard Drive man­u­fac­tur­ers provide a range of drives for dif­fer­ent pur­poses. The typ­ic­al drives most of us buy are con­sumer level drives. The man­u­fac­tur­ers also offer enter­prise-class drives designed for serv­ers which have intens­ive use pat­terns and 24.7 uptime. These drives are often phys­ic­ally identic­al, but have under­gone addi­tion­al test­ing and are sup­plied with slightly dif­fer­ent firm­ware, optim­ised for serv­er workloads.

One of these fea­tures is Error Recov­ery Con­trol (ERC). This fea­ture is also called CCTL (Com­mand Com­ple­tion Time Lim­it) by Sam­sung and Hita­chi and TLER (Time-Lim­ited Error Recov­ery) by West­ern Digit­al. All drives suf­fer the occa­sion­al error at a phys­ic­al level, which could be caused by things like stray cos­mic rays. These errors are handled by redund­ancy built into the way the drive stores data, but occa­sion­ally one can be severe enough to cause prob­lems read­ing data. Nor­mal con­sumer drives will spend a pro­longed peri­od attempt­ing to read the dam­aged data to recov­er it. They then map it to a new part of the drive and everything con­tin­ues as nor­mal. How­ever, this delay can cause severe prob­lems in enter­prise envir­on­ments, so enter­prise drives will time-out their self-repair attempts after a short peri­od (usu­ally 7 seconds or so) and report the error to the raid con­trol­ler. The raid con­trol­ler then handles the error by recal­cu­lat­ing the data using the oth­er drives in the array. This pre­vents large delays in send­ing data, but requires the pres­ence of oth­er drives and a raid controller.

So, I have a prop­er hard­ware raid card. It expects to hear back from drives with­in no more than 7–8 seconds regard­less of an error. I also have con­sumer hard drives, which attempt to repair their own errors for a long peri­od. So when an error occurs the drive tries to fix it, does­n’t respond with­in 7–8 seconds, and the raid con­trol­ler than assumes the drive has failed and kicks it out of the array.

So, the obvi­ous solu­tions would be either to tell the raid con­trol­ler to wait longer without kick­ing a drive out, OR tell the drive to give up after 7 seconds like an enter­prise drive… Infuri­at­ingly, neither is possible!

I have searched extens­ively, but I can­’t find any prop­er raid‑5 cards which allow the user to change how long they will wait for a drive. In the past there were some WD drives which could have the TLER fea­ture enabled with a util­ity released by WD called WD-TLER, but recently WD have dis­abled this option, pre­sum­ably to “pro­tect” the huge markup on their enter­prise drives (which are double the price for the same hardware)

Some people have found ways to tem­por­ar­ily enable ERC on some drives using either HDAT2, Smart­CTL of hdparm, how­ever these do not sup­port my RAID card under Win­dows, and the change is lost if the PC is power cycled.

For users like myself that need a large capa­city stor­age, and the fea­tures offered by a hard­ware raid‑5 solu­tion, but that do not need 24.7 uptime, long war­ranties or drives designed for heavy duty usage there is cur­rently NO appro­pri­ate solu­tion. Its about time either a drive man­u­fac­turer addressed this mar­ket (by releas­ing a con­sumer drive with ERC enabled for a small, e.g. 15%, premi­um) or a raid-card man­u­fac­turer addressed the mar­ket by offer­ing a card with the option to increase the time before drives are timed out. Cre­at­ing either of these solu­tions is trivi­al, a simple firm­ware tweak would do the job.

Until then, I advise oth­ers to avoid using hard­ware raid cards with con­sumer drives, and giv­en the price premi­um of enter­prise drives I recom­mend­ing avoid­ing hard­ware raid altogether.

Laat een antwoord achter

One Comment