Tunning HDD

mdadm

Calcul chunk size + mkfs pour ext et xfs

sysfs

https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt

Prenons comme exemple le disque sda.

scheduler

root@debian:~# cat /sys/block/sda/queue/scheduler 
noop deadline [cfq]

Il est posible également de le spécifier au au kernel avec elevator=noop

Attention : CFQ est le seul à prendre en compte ionice !

CFQ [cfq] (Completely Fair Queuing) is an I/O scheduler for the Linux kernel and default under many Linux distributions.

Noop scheduler (noop) is the simplest I/O scheduler for the Linux kernel based upon FIFO queue concept.

Anticipatory scheduler (anticipatory) is an algorithm for scheduling hard disk input/output as well as old scheduler which is replaced by CFQ

Deadline scheduler (deadline) - it attempt to guarantee a start service time for a request.

nr_requests

Représente la queue size, c'est à dire le nombre d'IO que l'on autorise à traiter simultanément par le kernel.

Note : Pour voir le taux de remplissage de la queue size, il faut regarder la colomne avgrq_sz avec la commande iostat -x :

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00

Attention : si la file d'attente est pleine, la mémoire totale prise par les demandes en attente est : 2 * nr_requests * max_sectors_kb, soyez donc prudent de garder cette limite raisonnable pour éviter les erreurs out-of-memory.

max_sectors_kb

C'est la longueur maximum par I/O en kilobytes.

C'est un paramètre très important ! En effet si l'on est connecté à une baie de stockage (SAN par exemple), il faut savoir ce que la baie accepte au maximum. Trop petit on pénalise les performances du serveur, trop gros on pénalise la baie qui devra redécouper les I/O en taille plus petite sollicitant ainsi le CPU de la baie et écroulant l'ensemble des performances ! Il est donc important de regarder les bonne pratique pour la baie en question…

read_ahead_kb

http://wiki.deimos.fr/Optimiser_les_performances_des_disques_dur_sur_Linux

C'est la lecture anticipée, cela sigifie que le kernel va lire plus de block qui lui est demandé ! Car lorsqu'un block est lu, il est fréquent de devoir lire celui d'après. Cela n'a de sens que pour les applications qui lisent séquentiellement les données ! Aucun intérêt pour les accès aléatoires.

Cela à l'avantage d'améliorer les temps de réponse et la vitesse des lectures séquentielles.

L'algorithme est conçu pour s'arrêter de lui même s'il détecte trop d'accès aléatoire pour ne pas détériorer les performances.

rotational

Spécifie si c'est un ssd (0) ou classique (1).

tuning cfq

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/Performance_Tuning_Guide/index.html

http://fr.wikipedia.org/wiki/Completely_Fair_Queuing

back_seek_max : Backward seeks are typically bad for performance, as they can incur greater delays in repositioning the heads than forward seeks do. However, CFQ will still perform them, if they are small enough. This tunable controls the maximum distance in KB the I/O scheduler will allow backward seeks. The default is 16 KB.
back_seek_penalty : Because of the inefficiency of backward seeks, a penalty is associated with each one. The penalty is a multiplier; for example, consider a disk head position at 1024KB. Assume there are two requests in the queue, one at 1008KB and another at 1040KB. The two requests are equidistant from the current head position. However, after applying the back seek penalty (default: 2), the request at the later position on disk is now twice as close as the earlier request. Thus, the head will move forward.
fifo_expire_async : This tunable controls how long an async (buffered write) request can go unserviced. After the expiration time (in milliseconds), a single starved async request will be moved to the dispatch list. The default is 250 ms.
fifo_expire_sync : This is the same as the fifo_expire_async tunable, for for synchronous (read and O_DIRECT write) requests. The default is 125 ms.
group_idle : When set, CFQ will idle on the last process issuing I/O in a cgroup. This should be set to 1 when using proportional weight I/O cgroups and setting slice_idle to 0 (typically done on fast storage).
group_isolation : If group isolation is enabled (set to 1), it provides a stronger isolation between groups at the expense of throughput. Generally speaking, if group isolation is disabled, fairness is provided for sequential workloads only. Enabling group isolation provides fairness for both sequential and random workloads. The default value is 0 (disabled). Refer to Documentation/cgroups/blkio-controller.txt for further information.
low_latency : When low latency is enabled (set to 1), CFQ attempts to provide a maximum wait time of 300 ms for each process issuing I/O on a device. This favors fairness over throughput. Disabling low latency (setting it to 0) ignores target latency, allowing each process in the system to get a full time slice. Low latency is enabled by default.
quantum : The quantum controls the number of I/Os that CFQ will send to the storage at a time, essentially limiting the device queue depth. By default, this is set to 8. The storage may support much deeper queue depths, but increasing quantum will also have a negative impact on latency, especially in the presence of large sequential write workloads.
slice_async : This tunable controls the time slice allotted to each process issuing asynchronous (buffered write) I/O. By default it is set to 40 ms.
slice_idle : This specifies how long CFQ should idle while waiting for further requests. The default value in Red Hat Enterprise Linux 6.1 and earlier is 8 ms. In Red Hat Enterprise Linux 6.2 and later, the default value is 0. The zero value improves the throughput of external RAID storage by removing all idling at the queue and service tree level. However, a zero value can degrade throughput on internal non-RAID storage, because it increases the overall number of seeks. For non-RAID storage, we recommend a slice_idle value that is greater than 0.
slice_sync : This tunable dictates the time slice allotted to a process issuing synchronous (read or direct write) I/O. The default is 100 ms.

tuning deadline

http://fr.wikipedia.org/wiki/Deadline_scheduler

fifo_batch : This determines the number of reads or writes to issue in a single batch. The default is 16. Setting this to a higher value may result in better throughput, but will also increase latency.
front_merges : You can set this tunable to 0 if you know your workload will never generate front merges. Unless you have measured the overhead of this check, it is advisable to leave it at its default setting (1).
read_expire : This tunable allows you to set the number of milliseconds in which a read request should be serviced. By default, this is set to 500 ms (half a second).
write_expire : This tunable allows you to set the number of milliseconds in which a write request should be serviced. By default, this is set to 5000 ms (five seconds).
writes_starved : This tunable controls how many read batches can be processed before processing a single write batch. The higher this is set, the more preference is given to reads.

Script

Source : http://pastebin.com/qajPwY6K#

tuning.sh

#!/bin/bash
###############################################################################
#  simple script to set some parameters to increase performance on a mdadm
# raid5 or raid6. Ajust the ## parameters ##-section to your system!
#
#  WARNING: depending on stripesize and the number of devices the array might
# use QUITE a lot of memory after optimization!
#
#  27may2010 by Alexander Peganz
#  09oct2010 by Rafael Fonseca -- added option to tune2fs different drive
###############################################################################
 
 
## parameters ##
MDDEV=md0               # e.g. md51 for /dev/md51
#FSDEV=$MDDEV     	    # same as above for filesystem on device
FSDEV=lvm-raid/lvm0     # used for LVM on RAID setups - e.g. /dev/lvm-raid/lvm
CHUNKSIZE=1024          # in kb
BLOCKSIZE=4             # of file system in kb
NCQ=enable              # disable, enable. ath. else keeps current setting
NCQDEPTH=31             # 31 should work for almost anyone
FORCECHUNKSIZE=true     # force max sectors kb to chunk size > 512
DOTUNEFS=true           # run tune2fs, ONLY SET TO true IF YOU USE EXT[34]
RAIDLEVEL=raid5         # raid5, raid6
 
 
## code ##
# test for priviledges
if [ "$(whoami)" != 'root' ]
then
  echo $(date): Need to be root
  exit 1
fi
 
# set number of parity devices
NUMPARITY=1
if [[ $RAIDLEVEL == "raid6" ]]
then
  NUMPARITY=2
fi
 
# get all devices
DEVSTR="`grep \"^$MDDEV : \" /proc/mdstat` eol"
while [ -z "`expr match \"$DEVSTR\" '\(\<sd[a-z]\[[12]\?[0-9]\]\((S)\)\? \)'`" ]
do
  DEVSTR="`echo $DEVSTR|cut -f 2- -d \ `"
done
 
# get active devices list and spares list
DEVS=""
SPAREDEVS=""
while [ "$DEVSTR" != "eol" ]; do
  CURDEV="`echo $DEVSTR|cut -f -1 -d \ `"
  if [ -n "`expr match \"$CURDEV\" '\(\<sd[a-z]\[[12]\?[0-9]\]\((S)\)\)'`" ]
  then
    SPAREDEVS="$SPAREDEVS${CURDEV:2:1}"
  elif [ -n "`expr match \"$CURDEV\" '\(\<sd[a-z]\[[12]\?[0-9]\]\)'`" ]
  then
    DEVS="$DEVS${CURDEV:2:1}"
  fi
  DEVSTR="`echo $DEVSTR|cut -f 2- -d \ `"
done
NUMDEVS=${#DEVS}
NUMSPAREDEVS=${#SPAREDEVS}
 
# test if number of devices makes sense
if [ ${#DEVS} -lt $[1+$NUMPARITY] ]
then
  echo $(date): Need more devices
  exit 1
fi
 
# set read ahead
RASIZE=$[$NUMDEVS*($NUMDEVS-$NUMPARITY)*2*$CHUNKSIZE]   # in 512b blocks
echo read ahead size per device: $RASIZE blocks \($[$RASIZE/2]kb\)
MDRASIZE=$[$RASIZE*$NUMDEVS]
echo read ahead size of array: $MDRASIZE blocks \($[$MDRASIZE/2]kb\)
blockdev --setra $RASIZE /dev/sd[$DEVS]
if [ $NUMSPAREDEVS -gt 0 ]
then
  blockdev --setra $RASIZE /dev/sd[$SPAREDEVS]
fi
blockdev --setra $MDRASIZE /dev/$MDDEV
 
# set stripe cache size
STRCACHESIZE=$[$RASIZE/8]                               # in pages per device
echo stripe cache size of devices: $STRCACHESIZE pages \($[$STRCACHESIZE*4]kb\)
echo $STRCACHESIZE > /sys/block/$MDDEV/md/stripe_cache_size
 
# set max sectors kb
DEVINDEX=0
MINMAXHWSECKB=$(cat /sys/block/sd${DEVS:0:1}/queue/max_hw_sectors_kb)
until [ $DEVINDEX -ge $NUMDEVS ]
do
  DEVLETTER=${DEVS:$DEVINDEX:1}
  MAXHWSECKB=$(cat /sys/block/sd$DEVLETTER/queue/max_hw_sectors_kb)
  if [ $MAXHWSECKB -lt $MINMAXHWSECKB ]
  then
    MINMAXHWSECKB=$MAXHWSECKB
  fi
  DEVINDEX=$[$DEVINDEX+1]
done
if [ $CHUNKSIZE -le $MINMAXHWSECKB ] &&
  ( [ $CHUNKSIZE -le 512 ] || [[ $FORCECHUNKSIZE == "true" ]] )
then
  echo setting max sectors kb to match chunk size
  DEVINDEX=0
  until [ $DEVINDEX -ge $NUMDEVS ]
  do
    DEVLETTER=${DEVS:$DEVINDEX:1}
    echo $CHUNKSIZE > /sys/block/sd$DEVLETTER/queue/max_sectors_kb
    DEVINDEX=$[$DEVINDEX+1]
  done
  DEVINDEX=0
  until [ $DEVINDEX -ge $NUMSPAREDEVS ]
  do
    DEVLETTER=${SPAREDEVS:$DEVINDEX:1}
    echo $CHUNKSIZE > /sys/block/sd$DEVLETTER/queue/max_sectors_kb
    DEVINDEX=$[$DEVINDEX+1]
  done
fi
 
# enable/disable NCQ
DEVINDEX=0
if [[ $NCQ == "enable" ]] || [[ $NCQ == "disable" ]]
then
  if [[ $NCQ == "disable" ]]
  then
    NCQDEPTH=1
  fi
  echo setting NCQ queue depth to $NCQDEPTH
  until [ $DEVINDEX -ge $NUMDEVS ]
  do
    DEVLETTER=${DEVS:$DEVINDEX:1}
    echo $NCQDEPTH > /sys/block/sd$DEVLETTER/device/queue_depth
    DEVINDEX=$[$DEVINDEX+1]
  done
  DEVINDEX=0
  until [ $DEVINDEX -ge $NUMSPAREDEVS ]
  do
    DEVLETTER=${SPAREDEVS:$DEVINDEX:1}
    echo $NCQDEPTH > /sys/block/sd$DEVLETTER/device/queue_depth
    DEVINDEX=$[$DEVINDEX+1]
  done
fi
 
# tune2fs
if [[ $DOTUNEFS == "true" ]]
then
  STRIDE=$[$CHUNKSIZE/$BLOCKSIZE]
  STRWIDTH=$[$CHUNKSIZE/$BLOCKSIZE*($NUMDEVS-$NUMPARITY)]
  echo setting stride to $STRIDE blocks \($CHUNKSIZEkb\)
  echo setting stripe-width to $STRWIDTH blocks \($[$STRWIDTH*$BLOCKSIZE]kb\)
  tune2fs -E stride=$STRIDE,stripe-width=$STRWIDTH /dev/$FSDEV
fi
 
# exit
echo $(date): Success
exit 0