Table des matières
Tunning HDD
mdadm
sysfs
https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt
Prenons comme exemple le disque sda.
scheduler
root@debian:~# cat /sys/block/sda/queue/scheduler noop deadline [cfq]
Il est posible également de le spécifier au au kernel avec elevator=noop
Attention : CFQ est le seul à prendre en compte ionice !
- CFQ [cfq] (Completely Fair Queuing) is an I/O scheduler for the Linux kernel and default under many Linux distributions.
- Noop scheduler (noop) is the simplest I/O scheduler for the Linux kernel based upon FIFO queue concept.
- Anticipatory scheduler (anticipatory) is an algorithm for scheduling hard disk input/output as well as old scheduler which is replaced by CFQ
Deadline scheduler (deadline) - it attempt to guarantee a start service time for a request.
nr_requests
Représente la queue size, c'est à dire le nombre d'IO que l'on autorise à traiter simultanément par le kernel.
Note : Pour voir le taux de remplissage de la queue size, il faut regarder la colomne avgrq_sz avec la commande iostat -x :
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
Attention : si la file d'attente est pleine, la mémoire totale prise par les demandes en attente est : 2 * nr_requests * max_sectors_kb, soyez donc prudent de garder cette limite raisonnable pour éviter les erreurs out-of-memory.
max_sectors_kb
C'est la longueur maximum par I/O en kilobytes.
C'est un paramètre très important ! En effet si l'on est connecté à une baie de stockage (SAN par exemple), il faut savoir ce que la baie accepte au maximum. Trop petit on pénalise les performances du serveur, trop gros on pénalise la baie qui devra redécouper les I/O en taille plus petite sollicitant ainsi le CPU de la baie et écroulant l'ensemble des performances ! Il est donc important de regarder les bonne pratique pour la baie en question…
read_ahead_kb
http://wiki.deimos.fr/Optimiser_les_performances_des_disques_dur_sur_Linux
C'est la lecture anticipée, cela sigifie que le kernel va lire plus de block qui lui est demandé ! Car lorsqu'un block est lu, il est fréquent de devoir lire celui d'après. Cela n'a de sens que pour les applications qui lisent séquentiellement les données ! Aucun intérêt pour les accès aléatoires.
Cela à l'avantage d'améliorer les temps de réponse et la vitesse des lectures séquentielles.
L'algorithme est conçu pour s'arrêter de lui même s'il détecte trop d'accès aléatoire pour ne pas détériorer les performances.
rotational
Spécifie si c'est un ssd (0) ou classique (1).
tuning cfq
http://fr.wikipedia.org/wiki/Completely_Fair_Queuing
- back_seek_max : Backward seeks are typically bad for performance, as they can incur greater delays in repositioning the heads than forward seeks do. However, CFQ will still perform them, if they are small enough. This tunable controls the maximum distance in KB the I/O scheduler will allow backward seeks. The default is 16 KB.
- back_seek_penalty : Because of the inefficiency of backward seeks, a penalty is associated with each one. The penalty is a multiplier; for example, consider a disk head position at 1024KB. Assume there are two requests in the queue, one at 1008KB and another at 1040KB. The two requests are equidistant from the current head position. However, after applying the back seek penalty (default: 2), the request at the later position on disk is now twice as close as the earlier request. Thus, the head will move forward.
- fifo_expire_async : This tunable controls how long an async (buffered write) request can go unserviced. After the expiration time (in milliseconds), a single starved async request will be moved to the dispatch list. The default is 250 ms.
- fifo_expire_sync : This is the same as the fifo_expire_async tunable, for for synchronous (read and O_DIRECT write) requests. The default is 125 ms.
- group_idle : When set, CFQ will idle on the last process issuing I/O in a cgroup. This should be set to 1 when using proportional weight I/O cgroups and setting slice_idle to 0 (typically done on fast storage).
- group_isolation : If group isolation is enabled (set to 1), it provides a stronger isolation between groups at the expense of throughput. Generally speaking, if group isolation is disabled, fairness is provided for sequential workloads only. Enabling group isolation provides fairness for both sequential and random workloads. The default value is 0 (disabled). Refer to Documentation/cgroups/blkio-controller.txt for further information.
- low_latency : When low latency is enabled (set to 1), CFQ attempts to provide a maximum wait time of 300 ms for each process issuing I/O on a device. This favors fairness over throughput. Disabling low latency (setting it to 0) ignores target latency, allowing each process in the system to get a full time slice. Low latency is enabled by default.
- quantum : The quantum controls the number of I/Os that CFQ will send to the storage at a time, essentially limiting the device queue depth. By default, this is set to 8. The storage may support much deeper queue depths, but increasing quantum will also have a negative impact on latency, especially in the presence of large sequential write workloads.
- slice_async : This tunable controls the time slice allotted to each process issuing asynchronous (buffered write) I/O. By default it is set to 40 ms.
- slice_idle : This specifies how long CFQ should idle while waiting for further requests. The default value in Red Hat Enterprise Linux 6.1 and earlier is 8 ms. In Red Hat Enterprise Linux 6.2 and later, the default value is 0. The zero value improves the throughput of external RAID storage by removing all idling at the queue and service tree level. However, a zero value can degrade throughput on internal non-RAID storage, because it increases the overall number of seeks. For non-RAID storage, we recommend a slice_idle value that is greater than 0.
- slice_sync : This tunable dictates the time slice allotted to a process issuing synchronous (read or direct write) I/O. The default is 100 ms.
tuning deadline
http://fr.wikipedia.org/wiki/Deadline_scheduler
- fifo_batch : This determines the number of reads or writes to issue in a single batch. The default is 16. Setting this to a higher value may result in better throughput, but will also increase latency.
- front_merges : You can set this tunable to 0 if you know your workload will never generate front merges. Unless you have measured the overhead of this check, it is advisable to leave it at its default setting (1).
- read_expire : This tunable allows you to set the number of milliseconds in which a read request should be serviced. By default, this is set to 500 ms (half a second).
- write_expire : This tunable allows you to set the number of milliseconds in which a write request should be serviced. By default, this is set to 5000 ms (five seconds).
- writes_starved : This tunable controls how many read batches can be processed before processing a single write batch. The higher this is set, the more preference is given to reads.
Script
Source : http://pastebin.com/qajPwY6K#
- tuning.sh
#!/bin/bash ############################################################################### # simple script to set some parameters to increase performance on a mdadm # raid5 or raid6. Ajust the ## parameters ##-section to your system! # # WARNING: depending on stripesize and the number of devices the array might # use QUITE a lot of memory after optimization! # # 27may2010 by Alexander Peganz # 09oct2010 by Rafael Fonseca -- added option to tune2fs different drive ############################################################################### ## parameters ## MDDEV=md0 # e.g. md51 for /dev/md51 #FSDEV=$MDDEV # same as above for filesystem on device FSDEV=lvm-raid/lvm0 # used for LVM on RAID setups - e.g. /dev/lvm-raid/lvm CHUNKSIZE=1024 # in kb BLOCKSIZE=4 # of file system in kb NCQ=enable # disable, enable. ath. else keeps current setting NCQDEPTH=31 # 31 should work for almost anyone FORCECHUNKSIZE=true # force max sectors kb to chunk size > 512 DOTUNEFS=true # run tune2fs, ONLY SET TO true IF YOU USE EXT[34] RAIDLEVEL=raid5 # raid5, raid6 ## code ## # test for priviledges if [ "$(whoami)" != 'root' ] then echo $(date): Need to be root exit 1 fi # set number of parity devices NUMPARITY=1 if [[ $RAIDLEVEL == "raid6" ]] then NUMPARITY=2 fi # get all devices DEVSTR="`grep \"^$MDDEV : \" /proc/mdstat` eol" while [ -z "`expr match \"$DEVSTR\" '\(\<sd[a-z]\[[12]\?[0-9]\]\((S)\)\? \)'`" ] do DEVSTR="`echo $DEVSTR|cut -f 2- -d \ `" done # get active devices list and spares list DEVS="" SPAREDEVS="" while [ "$DEVSTR" != "eol" ]; do CURDEV="`echo $DEVSTR|cut -f -1 -d \ `" if [ -n "`expr match \"$CURDEV\" '\(\<sd[a-z]\[[12]\?[0-9]\]\((S)\)\)'`" ] then SPAREDEVS="$SPAREDEVS${CURDEV:2:1}" elif [ -n "`expr match \"$CURDEV\" '\(\<sd[a-z]\[[12]\?[0-9]\]\)'`" ] then DEVS="$DEVS${CURDEV:2:1}" fi DEVSTR="`echo $DEVSTR|cut -f 2- -d \ `" done NUMDEVS=${#DEVS} NUMSPAREDEVS=${#SPAREDEVS} # test if number of devices makes sense if [ ${#DEVS} -lt $[1+$NUMPARITY] ] then echo $(date): Need more devices exit 1 fi # set read ahead RASIZE=$[$NUMDEVS*($NUMDEVS-$NUMPARITY)*2*$CHUNKSIZE] # in 512b blocks echo read ahead size per device: $RASIZE blocks \($[$RASIZE/2]kb\) MDRASIZE=$[$RASIZE*$NUMDEVS] echo read ahead size of array: $MDRASIZE blocks \($[$MDRASIZE/2]kb\) blockdev --setra $RASIZE /dev/sd[$DEVS] if [ $NUMSPAREDEVS -gt 0 ] then blockdev --setra $RASIZE /dev/sd[$SPAREDEVS] fi blockdev --setra $MDRASIZE /dev/$MDDEV # set stripe cache size STRCACHESIZE=$[$RASIZE/8] # in pages per device echo stripe cache size of devices: $STRCACHESIZE pages \($[$STRCACHESIZE*4]kb\) echo $STRCACHESIZE > /sys/block/$MDDEV/md/stripe_cache_size # set max sectors kb DEVINDEX=0 MINMAXHWSECKB=$(cat /sys/block/sd${DEVS:0:1}/queue/max_hw_sectors_kb) until [ $DEVINDEX -ge $NUMDEVS ] do DEVLETTER=${DEVS:$DEVINDEX:1} MAXHWSECKB=$(cat /sys/block/sd$DEVLETTER/queue/max_hw_sectors_kb) if [ $MAXHWSECKB -lt $MINMAXHWSECKB ] then MINMAXHWSECKB=$MAXHWSECKB fi DEVINDEX=$[$DEVINDEX+1] done if [ $CHUNKSIZE -le $MINMAXHWSECKB ] && ( [ $CHUNKSIZE -le 512 ] || [[ $FORCECHUNKSIZE == "true" ]] ) then echo setting max sectors kb to match chunk size DEVINDEX=0 until [ $DEVINDEX -ge $NUMDEVS ] do DEVLETTER=${DEVS:$DEVINDEX:1} echo $CHUNKSIZE > /sys/block/sd$DEVLETTER/queue/max_sectors_kb DEVINDEX=$[$DEVINDEX+1] done DEVINDEX=0 until [ $DEVINDEX -ge $NUMSPAREDEVS ] do DEVLETTER=${SPAREDEVS:$DEVINDEX:1} echo $CHUNKSIZE > /sys/block/sd$DEVLETTER/queue/max_sectors_kb DEVINDEX=$[$DEVINDEX+1] done fi # enable/disable NCQ DEVINDEX=0 if [[ $NCQ == "enable" ]] || [[ $NCQ == "disable" ]] then if [[ $NCQ == "disable" ]] then NCQDEPTH=1 fi echo setting NCQ queue depth to $NCQDEPTH until [ $DEVINDEX -ge $NUMDEVS ] do DEVLETTER=${DEVS:$DEVINDEX:1} echo $NCQDEPTH > /sys/block/sd$DEVLETTER/device/queue_depth DEVINDEX=$[$DEVINDEX+1] done DEVINDEX=0 until [ $DEVINDEX -ge $NUMSPAREDEVS ] do DEVLETTER=${SPAREDEVS:$DEVINDEX:1} echo $NCQDEPTH > /sys/block/sd$DEVLETTER/device/queue_depth DEVINDEX=$[$DEVINDEX+1] done fi # tune2fs if [[ $DOTUNEFS == "true" ]] then STRIDE=$[$CHUNKSIZE/$BLOCKSIZE] STRWIDTH=$[$CHUNKSIZE/$BLOCKSIZE*($NUMDEVS-$NUMPARITY)] echo setting stride to $STRIDE blocks \($CHUNKSIZEkb\) echo setting stripe-width to $STRWIDTH blocks \($[$STRWIDTH*$BLOCKSIZE]kb\) tune2fs -E stride=$STRIDE,stripe-width=$STRWIDTH /dev/$FSDEV fi # exit echo $(date): Success exit 0