network

Building a home media server with ZFS and a gaming virtual machine

Work has kept me busy lately so it's been a while since my last post... I have been doing lots of research and collecting lots of information over the holiday break and I'm happy to say that in the coming days I will be posting a new server setup guide, this time for a server that is capable of running redundant storage (ZFS RAIDZ2), sharing home media (Plex Media Server, SMB, AFP) as well as a full Windows 7 gaming rig simultaneously!

Windows runs in a virtual machine and is assigned it's own real graphics card from the host's hardware using the using the brand-new VFIO PCI passthrough technique with the VGA quirks enabled. This does require a motherboard and CPU with support for IOMMU, more commonly known as VT-d or AMD-Vi.

Avoiding kernel crashes when under DDoS attacks with CentOS 5

iWeb Technologies has recently been the victim of several [1 2 3 4] distributed denial of service (DDoS) attacks over the past three or so weeks and it's become a rather irritating issue for iWeb's customers. Not all of their server or their customer's co-located servers are affected in each attack, but often during the attacks there is a general slowdown on their network and a minority of the servers (both hosted and co-located) experience some packet loss.

The server hosting this website was, unfortunately, was one the servers that went completely offline during the DDoS attack on October 14th. iWeb has been a great host so far and their staff looked into the issue immediately after I submitted a support ticket, so within 30 minutes I had a KVM/IP attached to my server.

After an hour or so iWeb had the DDoS attacks under control but my server's kernel would panic within roughly 10 minutes of the network going up. The time between each kernel panic was inconsistent, but would it would crash every time the network went up given a bit of time. I found this message repeated in my server logs every time before a hang:

ipt_hook: happy cracking.

That led me to this post dating back to November 2003 on the Linux Kernel Mailing Lists (LKML) about the same message. I jumped on irc.freenode.net and joined #netfilter to ask what they thought about the message. While I waited for a response from #netfilter, I looked into installing kexec in order to get a backtrace since the kernel oops messages were not being logged (This tutorial proved especially handy).

The backtrace revealed that the problem was indeed in the netfilter/iptables kernel modules, but seemed to be triggered by QEMU-KVM:

Process qemu-kvm (pid: 3911, threadinfo ffff8101fc74c000, task ffff8101fe4ba100)
Stack:  0001d7000001d600 0001d9000001d800 0001db000001da00 0001dd000001dc00
0001df000001de00 0001e1000001e000 0001e3000001e200 0001e5000001e400
0001e7000001e600 0001e9000001e800 ffff81020001ea00 ffff8101fe5d8bc0
Call Trace:
<IRQ>  [<ffffffff80236459>] dev_hard_start_xmit+0x1b7/0x28a
[<ffffffff88665bab>] :ip_tables:ipt_do_table+0x295/0x2fa
[<ffffffff886c7b2c>] :bridge:br_nf_post_routing+0x17c/0x197
[<ffffffff80034077>] nf_iterate+0x41/0x7d
[<ffffffff886c7816>] :bridge:br_nf_local_out_finish+0x0/0x9b
[<ffffffff800565d5>] nf_hook_slow+0x58/0xbc
[<ffffffff886c7816>] :bridge:br_nf_local_out_finish+0x0/0x9b
[<ffffffff8000f470>] __alloc_pages+0x78/0x308
[<ffffffff886c85f9>] :bridge:br_nf_local_out+0x23f/0x25e
[<ffffffff80034077>] nf_iterate+0x41/0x7d
[<ffffffff886c3192>] :bridge:br_forward_finish+0x0/0x51
[<ffffffff800565d5>] nf_hook_slow+0x58/0xbc
[<ffffffff886c3192>] :bridge:br_forward_finish+0x0/0x51
[<ffffffff8025042d>] rt_intern_hash+0x474/0x4a0
[<ffffffff886c3367>] :bridge:__br_deliver+0xb4/0xfc
[<ffffffff886c2294>] :bridge:br_dev_xmit+0xc7/0xdb
[<ffffffff80236459>] dev_hard_start_xmit+0x1b7/0x28a
[<ffffffff8002f76f>] dev_queue_xmit+0x1f3/0x2a3
[<ffffffff80031e19>] ip_output+0x2ae/0x2dd
[<ffffffff8025359a>] ip_forward+0x24f/0x2bd
[<ffffffff8003587a>] ip_rcv+0x539/0x57c
[<ffffffff80020c21>] netif_receive_skb+0x470/0x49f
[<ffffffff886c3ef9>] :bridge:br_handle_frame_finish+0x1bc/0x1d3
[<ffffffff886c801b>] :bridge:br_nf_pre_routing_finish+0x2e9/0x2f8
[<ffffffff886c7d32>] :bridge:br_nf_pre_routing_finish+0x0/0x2f8
[<ffffffff800565d5>] nf_hook_slow+0x58/0xbc
[<ffffffff886c7d32>] :bridge:br_nf_pre_routing_finish+0x0/0x2f8
[<ffffffff886c8c18>] :bridge:br_nf_pre_routing+0x600/0x61c
[<ffffffff80034077>] nf_iterate+0x41/0x7d
[<ffffffff886c3d3d>] :bridge:br_handle_frame_finish+0x0/0x1d3
[<ffffffff800565d5>] nf_hook_slow+0x58/0xbc
[<ffffffff886c3d3d>] :bridge:br_handle_frame_finish+0x0/0x1d3
[<ffffffff886c407e>] :bridge:br_handle_frame+0x16e/0x1a4
[<ffffffff800a4e3f>] ktime_get_ts+0x1a/0x4e
[<ffffffff80020b34>] netif_receive_skb+0x383/0x49f
[<ffffffff8003055c>] process_backlog+0x89/0xe7
[<ffffffff8000ca51>] net_rx_action+0xac/0x1b1
[<ffffffff80012562>] __do_softirq+0x89/0x133
[<ffffffff8005e2fc>] call_softirq+0x1c/0x28
<EOI>  [<ffffffff8006d636>] do_softirq+0x2c/0x7d
[<ffffffff8004de57>] netif_rx_ni+0x19/0x1d
[<ffffffff887b951d>] :tun:tun_chr_writev+0x3b4/0x402
[<ffffffff887b956b>] :tun:tun_chr_write+0x0/0x1f
[<ffffffff800e3307>] do_readv_writev+0x172/0x291
[<ffffffff887b956b>] :tun:tun_chr_write+0x0/0x1f
[<ffffffff80041ef3>] do_ioctl+0x21/0x6b
[<ffffffff8002fff5>] vfs_ioctl+0x457/0x4b9
[<ffffffff800b9c60>] audit_syscall_entry+0x1a8/0x1d3
[<ffffffff800e34b0>] sys_writev+0x45/0x93
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

Code: c3 41 56 41 55 41 54 55 48 89 fd 53 8b 87 88 00 00 00 89 c2
RIP  [<ffffffff80268427>] icmp_send+0x5bf/0x5c0
RSP <ffff810107bb7830>

This was interesting as I do have several KVM virtual machines running on the server, some with bridged networking and others with shared networking. The #netfilter guys confirmed that the happy cracking message was due to the attempted creation of malformed packets by root. My guest guess was that some of the packets from the DDoS attacks were hitting my server and so the bridge was faithfully attempting to forward those invalid packets to one of the virtual machine's network interfaces, causing problems in icmp_send().

The LKML message hinted the REJECT policy could be at fault, so I opened up /etc/sysconfig/iptables and switched to a DROP policy:

*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:RH-Firewall-1-INPUT - [0:0]
-A INPUT -j RH-Firewall-1-INPUT
-A FORWARD -m physdev  --physdev-is-bridged -j ACCEPT
# ... a bunch of forwarding rules for the shared network VMs
-A FORWARD -o virbr0 -j DROP # <-- THIS ONE
-A FORWARD -i virbr0 -j DROP # <-- THIS ONE
-A FORWARD -j RH-Firewall-1-INPUT
-A RH-Firewall-1-INPUT -i lo -j ACCEPT
-A RH-Firewall-1-INPUT -p icmp -m icmp --icmp-type any -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
# ... a bunch of ACCEPT rules for the server
-A RH-Firewall-1-INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A RH-Firewall-1-INPUT -j DROP # <-- THIS ONE
COMMIT

And that was all it took! I did not experience a hang after that. According to the guys in #netfilter, newer kernels reassemble packets from scratch when forwarding them, so if a machine is sent an invalid packet this problem is averted. However, CentOS 5 uses an older kernel and I guess this hasn't been backported.

TL;DR: If you using CentOS 5 (or any other distro with an older kernel), use a DROP policy in your iptables configuration instead of REJECT!

Downlink/uplink frequencies listed in the iPhone Field Test

I have been attempting to debug a poor reception problem with my iPhone 4 near my house. It has been confirmed that I reside in a dead zone (I am right in the middle of where 2 towers overlap, so I have the weakest signal from both locations) and there's no incentive from my cell provider to fix the problem since the dead zone is very small.

In an attempt to help document the problem, I wanted to capture the exact cell IDs and frequencies used when I experience reception problems. Activating the iPhone's Field Test Mode is easy enough (dial *3001#12345#*), but I quickly realized that something was off with the frequencies listed under by the UMTS/GSM RR Info panes. It was displaying download/upload frequencies of 1037/812 respectively, which is reasonable, but at other times it would show frequencies like 437/37 which made no sense at all.

After a bit of research, it looks like the the label for that data value is wrong; it should be channel and not frequency. Wikipedia has a list of UMTS (3G) frequency bands and the corresponding channel codes as well as the corresponding list for GSM (2G/EDGE). Channels numbers 1037/812 correspond to the 850MHz frequency band which I know Rogers, Telus & Bell all use in their new 3G network deployments. The other popular GSM band in use in Canada is 1900MHz PCS, and sure enough that's what 437/37 corresponds to. Problem solved!