RSS Feed
Nov 27

Cleaning up /tmp directory on busy cPanel web hosting servers

Posted on Thursday, November 27, 2008 in cpanel, error 22, linux, mysql, mysql.sock, php, tmp

Usually, the /tmp directory is one of the frequently accessed directories, temp files of MySQL, PHP and other applications are placed and often if processes die, left over. Uploads using PHP are always uploaded to the /tmp directory till they are complete, if you have some dying processes, you will end up with a filled /tmp directory which is hell.

Why? Because MySQL leaves and uses it’s temporary files in /tmp, and if there is no space in there, queries will start failing. Uploads from PHP or Perl are placed in there till the upload process is over, they cannot be further placed there because there is no more space left. So far, we have failing MySQL & inability to upload complete PHP files, system administrator hell.

Easy fix, you might say? Just a simply rm -rf / should take care of it? Nope. Try that, have fun trying to fix the sockets you deleted, specifically applications that depend on the mysql.sock placed in your /tmp directory, things just got worse. In case you actually did delete everything, just restart the services, they should re-appear, if they don’t, they should be somewhere else and you have to create a symbolic link using li, MySQL’s socket is usually located at /var/lib/mysql/mysql.sock.

The best way is to either have a script that cleans it up hourly if you know what usually fills it up or manually run ls -alhS /tmp | head and looking what’s causing the problem and how to avoid it in the future, I have developed a script that I run on multiple servers with no problems at the moment. It takes care of the most trash caused on a cPanel server

#!/bin/bash

# Change directory to /tmp
cd /tmp

# Clean up trash left by Gallery2
ls | grep '[0-9].inc*' | xargs rm -fv

# Clean up PHP temp. session files
ls | grep 'sess_*' | xargs rm -fv

# Clean up dead vBulletin uploads
ls | grep 'vbupload*' | xargs rm -fv

# Clean up failed php uploads
ls | grep 'php*' | xargs rm -fv

# Clean up failed ImageMagick conversions.
ls | grep 'magick*' | xargs rm -fv

That usually is enough, my suggestion is to have that run as a cronjob every hour, but I’m not going in detail on how to do that, because if you don’t know how to setup a cronjob, perhaps you shouldn’t be messing around in /tmp directories and deleting stuff on the first place!

Update: This script is faulty and will cause you a lot of problems with PHP sessions, please read more information and read the new one here

Nov 24

AACRAID based controllers timing out / aborting / SCSI hang

Posted on Monday, November 24, 2008 in aacraid, adaptec, cpanel, grsecurity, kernel, linux, network, raid, scsi

We’ve been lately starting to use more Adaptec RAID controllers rather than 3ware RAID controllers.  3ware has been nothing but trouble for us, dropping hard drives, even RAID5 arrays are running slower than a regular hard drive with no RAID.  Our latest issue was a server just simply having a Kernel Panic when using high IO, our experience with 3ware RAID controllers & Linux is terrible.

On this other side, Adaptec has been great.  We’ve been using them for a while now and see no problems at all, however there is just a small catch, Linux usually has a SCSI subsystem timeout of less than 30 seconds which results in a small difference between the controller timeout (at 35 seconds) versus the Linux timeout (at 30 seconds).  This usually brings a server to a halt for a couple of seconds (and minutes in cases) till the server recovers, errors like this are thrown in the console:

aacraid: Host adapter abort request (0,1,3,0)
aacraid: Host adapter abort request (0,1,1,0)
aacraid: Host adapter abort request (0,1,2,0)
aacraid: Host adapter abort request (0,1,1,0)
aacraid: Host adapter abort request (0,1,2,0)
aacraid: Host adapter reset request. SCSI hang ?

The best method that usually works best is to increase the timeout higher than 45 to ensure that the Linux timeout does not occur before the RAID controller timeout, this is done per device / array.

echo '45' > /sys/block/sda/device/timeout
echo '45' > /sys/block/sdb/device/timeout
echo '45' > /sys/block/sdc/device/timeout

This should be done to every device, 45 is a good number however you can use what you’d like as long as it’s over 35. If you’re experiencing issues with loads going sky-high with no apparent reason, this might very well be the reason, to check if this is a possible cause, you can run the following

dmesg | grep aacraid

If you see errors like the ones that I have up there, then I suggest using that small workaround, if even after using the workaround, you’re still facing these problems, here are the suggestions/checklist that Adaptec suggests:

  • Check for any updated firmware for the motherboard, controller, targets and enclosure on the respective manufacturer’s web sites.
  • Check per-device queue depth in SYSFS to make sure it is reasonable.
  • Engage disk drive manufacturer’s technical support department to check through compatibility or drive class issues.
  • Engage enclosure manufacturer’s technical support department to check through compatibility issues.

Anyhow, just like with every Linux issue, your mileage may vary, so if you know of any other fixes or figured out a way how to fix this, feel free to post it as a comment to help others.

Mar 23

Migrating LVM volumes over network (using snapshots)

Posted on Sunday, March 23, 2008 in lvm, migration, xen

We run a big share of Xen virtual servers spanned over multiple servers and if you want to use the full or best capability of Xen, I would suggest LVM (Logical Volume Manager), it makes life a lot easier, especially for those who do not run a RAID setup (We run RAID10 on all VM nodes) as you can split the partition over multiple hard drives. I’m not going to cover setting up the LVM as there are loads of tutorials on how to do that but I will rather cover the best way to migrate a LVM volume.

First, we will need to create a snapshot of the LVM volume as we cannot create an image of the live version, we run the following line:
lvcreate -L20G -s -n storageLV_s /dev/vGroup/storageLV
The 20G part is the size of the snapshot LVM, I would suggest looking up the size of the real original LV and making it the same, you can find out the size of the LV by using this command: lvdisplay /dev/vGroup/storageLV — There will be a “LV Size” field, get it from there and put it in the command, the -n switch is for the name, usually I name them the same as the LV with a trailing _s for snapshot, the last argument is simply the real LV that we want to make a snapshot of.

Afterwards, we will use dd in different way, usually if you use dd in one line, it’s either reading or it’s either writing which makes it crawl, to bypass this, we will read the LV and pipe it to one that writes so the minimum speed is the fastest speed of the slowest hard drive (I could re-phrase that but it’s 11:10 PM!) — To speed it up a bit more, we used a block size of 64K.
dd if=/dev/vGroup/storageLV_s conv=noerror,sync bs=64k | dd of=/migrate/storageLV_s.dd bs=64k

I won’t cover the file transfer process as there are multiple methods, if you want to use SCP, I would suggest disabling encryption or anything as it really slows it down, our node usually has httpd installed on them so I simply changed the configuration to listen on a different port (for security) and changed the DocumentRoot to /migrate

Once you got your file on the server, you’ll need to re-create the LV on the target server, you’ll need to run this
lvcreate -L20G -n storageLV vGroup
You’ll have to keep the same size, bring the same name (this time without a trailing _s as it won’t be a snapshot) and the volume group at the end.

The last step is to actually restore the image using dd, again using our block-size & pipe tweak for better performance.
dd if=/migrate/storageLV_s.dd conv=noerror,sync bs=64k | dd of=/dev/vGroup/storageLV bs=64k

I have migrated around 16 LVs with this method without any problems, 13 of them were 20G each, 2 40G and 1 75G — So far every part is fast however I have to admit that the slowest part was the file transfer, I would suggest using a Gbit crossover or even better if you have a Gbit switch, if you don’t but you’re right next to the server, might consider using a spare USB 2.0 HDD as they are much faster compared to 100mbps (USB2.0 is around 480Mbps).

Mar 23

Almost dead blog, but that’s not good.

Posted on Sunday, March 23, 2008 in linux, network

Oh man, I wish I would get to post here more often, then I’d probably have a lot less problems. These days, the simplest of tasks can become a big pain in the neck. I was doing some re-racking of servers to place them in correct order and I noticed one of the servers was connected on Port 2 instead of Port 1 (Ethernet) — I’m a big fan of consistency.

So I decided to change it, I get the monitor & keyboard and start messing. It sounded pretty simple, we have a service called ifaliases that activates all the other IPs by creating aliases (duh!). I edited that file to make the main interface eth0 instead of eth1, unplug from Port 2, plug in Port 1, wait a couple of minutes, …, still unpingable, ugh.

I tried to ping other servers in the same subnet, I was able to ping them. I checked the route table and it seems there was no default route for all traffic, I created a custom route and the network started to work, but the problem is that every time the interface is restarted, the route table would need to be recreated, blah blah, it’s a dumb idea.

I looked everywhere until I found the magical file: /etc/sysconfig/network — The contents of it look something like this:

NETWORKING=yes
GATEWAY=the.gw.ip.here
GATEWAYDEV=eth1
HOSTNAME=host.domain.com
DOMAINNAME=domain.com

It was pretty simple from there, change the GATEWAYDEV to the new port and voilà. So the next time you are trying to change ports or the default route won’t stick, I hope I helped!

Feb 23

Helpful cPanel included application

Posted on Saturday, February 23, 2008 in cpanel, linux

cPanel may have some very annoying bugs sometimes however there are very useful bits of it that can help in general system administration, dealing with a very busy server and trying to terminate an account that has high disk usage will make the load averages go sky high however thanks to this neat little application provided with cPanel, you can forget about freaking out on high server load. I have personally tried multiple solutions (including nice) but the loads would still go high and the server would be unusable.

With every cPanel installation, there is a binary located at /usr/local/cpanel/bin/cpuwatch, what cpuwatch does is that it executes the command and monitors the load, if the load goes past the set limit, it will stop the application and resume it after the load averages are below the threshold for a few seconds, the usage for it is very simply

cpuwatch (<-p PID>| [arguments])
maxload : system load at which throttling should commence
command : command to run under cpuwatch
-p PID : monitor and throttle the existing process PID

Another neat feature is that it can fork a new process or attach itself to a running process, here is an example of deleting an account using SSH and setting the load average threshold to 4.0

/usr/local/cpanel/bin/cpuwatch 4.0 /scripts/killacct username

The load average will go past 4 however the process will stop running as it goes past that limit, if you already have a process running and you do not want to restart it all, you can run the following command to attach it to the process, in this case, the process ID of my process is 18274.

/usr/local/cpanel/bin/cpuwatch 4.0 -p 18274

It’s a very simple but very neat utility that has saved me a few times where I had to do major file operations and did not want the server to have high load averages, this same binary is also used when the logs are running for cPanel and as well as when the cPanel backups are running.

Feb 23

Fantastico, PHPList and blank pages

Posted on Saturday, February 23, 2008 in cpanel, fantastico, linux

I’ve lately had a few clients complaining about their PHPList installs not working properly when using Fantastico, I tried it out myself and it does indeed seem to be somewhat broken. Usually when you go the adminstration page, it simply shows and empty page and if you view the source for that page, it contains something along of these lines:

The cause of this problem seems to usually be a default Fantastico setting that is not properly set, to fix this, you will need to edit your /config/config.php and replace the following:

define("PLUGIN_ROOTDIR","/tmp");

With the following, of course replacing username by your cPanel username and making sure that the path for your account is correct (sometimes, it’s home2 instead of home).

define("PLUGIN_ROOTDIR","/home/username/tmp");

After a few days of research on this and a few angry clients, it’s all figured out so I guess why not share them with other system administrators, it’ll save them some trouble!

Feb 23

cPanel backup clean-up

Posted on Saturday, February 23, 2008 in backup, cpanel, linux

One of the major pluses of cPanel that it doesn’t delete the backup of the account when terminating an account and neither it does remove it on the next backup run, while that can be a good thing if the customer comes back, on the long term, the backup drive starts to slowly run out of space and eventually these big accounts that have been stored for a while will need to be removed. I had to go cleanup the backup of one of the servers and I was not about to read them account-by-account however I have used my basic bash coding skills to whip up a small code that helps me in the process!

The code isn’t long at all and it isn’t anything genius, first I generate a small list of the accounts that are in the backup directory, we only run weekly backups and our backup is mounted at /backup so here is the command that we used first:

ls /backup/cpbackup/weekly/ > accts.backup

Now, a small loop that I created so that it filters out all the users that are in the backup directory but are not existent on the server anymore (or do not have the same username).

for i in `ls /home/`; do
replace $i '' -- accts.backup
done

There was a lot of accounts to be removed or deleted as this servers’ backups have been left hanging for a while so it would be a very hard task to actually find the accounts that have high disk usage so I made the following script that takes the size of each one of the accounts and puts it in a file.

for i in `cat accts.backup`; do
du -s /backup/cpbackup/weekly/$i
done > accts.size

After that, we make our life a lot easier by sorting the results by the size, however, there is one dangerous thing I warn you from, the “dirs” directory might appear as the biggest one but it contains usually MySQL files and configuration files which are very important, you should simply ignore it.

sort –n accts.size > accts.sorted

Now, you have a file with all the accounts that do not exist on the server but have backups with their size sorted, you can just go and clean them up as you go, until then, I go back to erasing these useless space hogs!

Feb 18

vmsplice, belated.

Posted on Monday, February 18, 2008 in 0-day, grsecurity, kernel, linux, recompiling, vmsplice

I’m sure everyone who has a server that has any sort of remote user accessing it has heard of this 0-day exploit (not so zero-day now, is it?) — Sadly, I belonged to those who heard of it after a few servers kernel panicked, I had no clue on what was the cause but I was going through the logs and noticed that LFD (props to ConfigServer for that useful software) had reported a suspicious process, decided to look what exactly it was, there was a x.c file, a simple cat show’s that it was copied from milw0rm, visited the related milw0rm page and I had discovered that I’m screwed. Thankfully, all of our servers ran CentOS 5 so all that happened was a kernel panic instead of being rooted.

After reading a bit about it, I prepared for a kernel compile, I had read about grsecurity and how it’s very commonly used so I decided to add it to my kernel plus the patch for the vmsplice exploit, I have setup a link to it here, sorry 64-bit folks, I only did it for i386/i686, even if our hardware is all 64-bit hardware, but I won’t start ranting about the software we use and how terrible the 64-bit support is. I have uploaded the kernel with grsecurity added and the patch for vmsplice here: linux-2.6.22.9-grsec-i686.tar.gz

I must admit, all of our servers are pretty fast and powerful, I mean when your lowest spec. server has the following specifications: 2x Intel Dual-Core 2.33Ghz Woodcrest CPUs, 4GB of RAM, 2TB total storage. Even with these specifications, kernel compiles took around 15-20 minutes, that may be fast compared to other servers however thankfully we have fast servers as I had to compile, twice, because when Linux means 4GB memory limit, it really means 2.5GB, I had to re-set the limit to 64GB for memory so the server can see all the memory.

Now, the instructions are pretty complicated at the beginning but at the end they are very straight forward, we’ll download the kernel first, you’ll either use the .config file that I had used or use your current kernel’s .config file, then a few make commands, I’ll also show how you can setup grub to fallback to your stable kernel if it actually crashes or something doesn’t work out.

mkdir /usr/src/linux
cd /usr/src/linux
wget http://files.momotonic.com/linux-2.6.22.9-grsec-i686.tar.gz
tar -xvzf linux-2.6.22.9-grsec-i686.tar.gz

The extraction should take a good few minutes as there are a lot of files to make all of these 1’s and 0’s. After that we have extracted the files, those who would want to use my configuration file can just stick with it but if you have more than 2-3GB of RAM, I’d suggest typing make menuconfig (when you are in the linux-2.6.22.9 directory), use your arrow keys to go down to “Processor type and features —>” and press Enter, then go to “High Memory Support (4GB) —>” and press Enter, select “( ) 64GB ” and press Enter, then use your right arrow to move in the bottom menu to exit, do that twice till it asks you to save the new kernel configuration, pick Yes (duh!) — If you just want to keep using your same configuration, then just copy it with this command: cp /boot/config-`uname -r` . (ofcourse, run that command in the linux-2.6.22.9 directory. You can modify a few other options and then we go to the longest process, compiling!

make
make modules_install
make install

Now, I will only cover instructions on how to do the fault-free reboot so if it does not work out, it will automatically be switched to the old kernel and everything will be like before. However, I have never worked with lilo and I’m not sure if you can do this so I will only cover instructions for doing it on grub. First, we’ll modify our /boot/grub/grub.conf file and change any of your options to the following ones

default=saved
fallback=1

Then we’ll need to modify the title for the CentOS boot and add the savedefault fallback so it should be something like the following:

title CentOS (2.6.22.9-grsec)
root (hd0,0)
kernel /boot/vmlinuz-2.6.22.9-grsec ro root=LABEL=/
initrd /boot/initrd-2.6.22.9-grsec.img
savedefault fallback

Now, we’ll first set grub to manually see that the main or default OS to run as the main operating system, in this setup, if the system does not boot up in a reasonable time, you can call your data center or use whatever method to reboot the server and it’ll boot to your old previous kernel that works fine, then you can take a look at your logs (mostly /var/log/messages) for the reason why it didn’t boot. We will need to run this command before rebooting:

echo "savedefault --default=0 --once" | grub --batch

Now you can safely reboot your server, I won’t even mention how to do that because if you do not know how to, I don’t think you should really be messing with kernels! Once you reboot, you can try re-running the exploit from milw0rm but I won’t mention on how to do that as evil minds will go and start using it.

Until then, I’m back to recompiling the kernels with 64GB memory limit because 4GB limit is actually 2.5GB!

Feb 17

Welcome to my blog!

Posted on Sunday, February 17, 2008 in blog, welcome

Thank you for visiting my site.

I am curretly a network and a system administrator at the moment and everyone knows that you can’t live a day in that job without having fun with a server or a network, somewhere, sometime, the wrong time most of the time.

I’d always get lost through what I did previously to fix the problem so I decided to whip up this blog quickly to keep information for me in the future and for others!

Enjoy reading!