Liuw's Thinkpad

想要赢就先学会输,想要成功就先学会失败

Archive for the ‘Tech’ Category

对于VirtIO on Xen项目的一点说明

without comments

陆陆续续地也有一些感兴趣的人发邮件来问VirtIO on Xen的情况,所以还是写一篇文章来解释一下这个项目的后续情况吧。

虽然我也写了一篇文章说明进一步开发VirtIO on Xen需要完成哪些工作。但是我认真想一下之后,这个项目应该还是不大会继续开发的了——也就是说,在那些toolstack的patch进去之后,后续开发的实用价值已经不是很大了。再具体点说,就是VirtIO on PV目前还是有一些无法克服的困难,具体的问题在前面的文章有提到。这些困难虽然目前是已经workaroud掉了,但是却是不能向上游反馈的,因为这些workaround已经break many things了。

所以对VirtIO on Xen的最后的想法,就是要在HVM里面用还是可以的,PV就最好不要用。一方面实现得比较差,另外一方面那些patch也是无法upstream的。

Written by liuw

November 30th, 2011 at 10:00 pm

Posted in Tech

Tagged with , ,

Xen netback改进

without comments

下面的文字是由Ian Campbell的邮件整理出来的,还有一些我自己的想法和查到的资料。

现在Xen netback的基本工作模式还是copying model,也就是说前后端数据交换的时候不是zero copy的。IanC目前正在向上游反馈skb paged fragment desctructor补丁,这系列补丁可以让backend对guest RX做grant mapping,从而实现zero copy。在copying model上进行后续工作意义不大。

紧接着就是一系列的重构。

目前的模型是driver domain的每个VCPU配置一个netback worker,这些netback worker在多个VIF之间共享。打算改为每个backend配置一个netback worker。在进行这样的改进时必须要注意内存的使用情况,原来的model中内存的使用基本是固定的,新model由于worker数目和backend数目相关,所以要有一定的扩展性。可以考虑使用内存池来防止worker过多导致内存消耗过大的问题。但是初步实现的话,静态内存分配也是可以接受的。

内部接口改用NAPI,这项改进依赖于前一项。大部分设备的TX completion都比较廉价,所以NAPI使用中最大的开销就是RX。但是切换过去之后(NAPI使用tasklet),就要考虑VCPU是否会过载(这也是目前还使用thread的原因)。这一部分需要仔细的测试。

再接下来的工作依赖于NAPI。

Receiver (guest) side copy,说得太模糊,不大清楚IanC的意思。

Multiple queues for TX and RX,这个可能和Stefano提到的VMDQ有关,把dispatch的的任务offload到硬件(原来是软件的bridge)。现在Intel的万兆网卡已经提供支持了,在VMWare上实现及测试的结果是有4x吞吐量约有1.3x的提升(4->9.2)。

除了上面的一系列想法之外,还有一个相对独立点的改进——netback对SR-IOV的支持。

想法很多,目前也不系统,先列一下,留待后用。

Written by liuw

October 17th, 2011 at 10:17 am

Posted in Tech

Tagged with , ,

Follow-ups of Virtio on Xen

without comments

This post should have finished a long time ago. Sorry folks, I don’t know if I really have time to finish what I left over, but it is really necessary to write it down so that someone (me?) will pick it up and move on…

I wrote some of my thoughts in Xenwiki’s VirtioOnXen page. However, those items are high-level and abstract. Now I will explain my TODOs for the second iteration.

* Enable Xen mapcache for Virtio for PV

The prototype has very bad exec implementation. For every READ/WRITE to memory, it first maps DomU’s memory, then r/w, then unmaps it. A single operation will cause two page table updates. That’s not ideal.

However it is not easy to enable mapcache in PV case. Xen’s PV memory model is quite different from HVM’s. Xen mapcache is originally designed for HVM and tightly coupled with QEMU – it is stubbed into cpu_physical_memory_rw. It is not easy to unify HVM and PV memory model (is it really possible?) and reuse cpu_physical_memory_rw.

What I can see to get this done is first we should re-factor the mapcache to de-couple it from the existing code. Then we can use the mapcache for PV exec.

* Squash two evtchns into one, eliminate locking

Currently two evtchns are used in the transport layer. One (evtchn1) for transport layer notification “be I need to control the device” (FE->BE), the other (evtchn2) is for backend notification telling drivers “hey, you have data waiting” (BE->FE).

Some may wonder why do we need two in the first place. All I can explain is that I want to emulate the behavior of Virtio in HVM’s case, that is synchronous “trap-process-return”. PV cannot really “trap”, it can only spin wait for a common variable (or a bit). With these two evtchns, locking is necessary, because “process” may trigger evtchn2, to which a IRQ handler is bound. That re-activates DomU from a different code path, and the transport layer never gets a chance to “trap” again.

So, how does squashing evtchns help producing lock-free code? We abandon evtchn2 and introduce a bit to indicate notification. We can defer data processing to work queue. This is how xen-pcifront and xen-pciback work.

To achieve this goal, we need to first introduce bitops into QEMU, because we have two bits now: BE acks FE’s control request and BE tells FE there is something waiting in the ring. These two bits must be handled in strict order, I would not go deep here, see xen-pcifront and xen-pciback for details. After introducing bitops, we can start re-factoring the transport layer, which should be easy.

* Enable Virtio device DMA capability

Don’t know if this is upstreamable, but I think it is worth trying. Virtio is designed for hardware-assisted virtualization, it holds an assumption that backend from the device’s point of view (say, qemu-kvm) can access the same address space as CPU does. However, in real world, devices don’t always have the same view as CPU does.

In order to create memory space consistent across CPU and backend, use of DMA API is necessary. Unfortunately, Virtio device doesn’t response correctly to DMA API (because they don’t have to, given the assumption mentioned earlier). So in current code, the ring is allocated by DMA API with NULL device, which is bad practice.

This task may not be trivial. Coding doesn’t require much effort. What concerns me most is keeping Virtio intact. If we are to alter simple kmalloc into DMA API, we have to change a lot of its design and internal too.

I swear that I’ve tried my best to explain my minds. To fully understand my decisions, you may have to read the fxxking code. :-)

Written by liuw

September 16th, 2011 at 8:08 pm

Posted in Tech

Tagged with , ,

Status update of Virtio on Xen project

without comments

Hi everybody, it’s midterm of Google Summer of Code now, let me tell you what I’ve done and learned during this period.

I started working on the project in the community bonding period. I took Virtio on Xen HVM as my warming up phase, which would help me understand QEMU and Virtio implementation better. Luckily, it did not require much work to get Virtio work on Xen HVM. At the end of the community bonding period, I wrote a patch to enable MSI injection for HVM guest, which has been applied to the tree.

Then I started to work on Virtio for pure PV. That’s not trivial. I spent lots of time trying to implement a Virtio transport layer with Xenbus, event channel and grant table, which is called virtio_xenbus (corresponding to current Virtio transport layer virtio_pci, which utilizes virtual PCI bus). The new transport layer must retain same behavior of the old one. However, one fundamental difference between evtchn and vpci is that, vpic works in a synchronous way while evtchn is born asynchronous. I got inspired by xen-pcifront and xen-pciback and finally solved this problem. Ah, a working transport layer finally.

But porting Virtio for pure PV needs more than a working transport layer. Vring, which is responsible for storing data, also needs some care. The original implementation uses kmalloc() to allocate the ring. It is OK to use kmalloc to get physical contiguous memory in HVM. However, Xen PV backend needs to access machine contiguous memory. So we have to enable Xen’s software IOTLB and replace kmalloc() with DMA API. Also, the physical address in scatter gather list should be replaced with machine address. So here we get a Vring implementation for pure PV guest.

Is that all? No. One feature we need to disable is the indirect buffer support. Because this feature causes specific driver to allocate buffers with kmalloc() in a much upper level. I tangled with this problem for sometime, finding that I would rather leave those drivers alone than break them. So I chosed to disable this feature at the moment. But this feature is critical to good performance, so I may try to enable it someday.

Good, we finally have our foundation ready! Let’s start to tangle with specific drivers. I chose Virtio net driver as a start. Every driver has its own features. As mentioned above, we should avoid allocating buffer with kmalloc() in driver level, so the CTRL_VQ feature needs to be disabled. In fact, I have no driver features enabled at the moment. What makes me really happy is that Virtio net almost works out of the box. I just want to make sure things work, pre-mature optimization is evil.

What to do next? Virtio blk is my next goal. Hopefully it would not take too long because I’m a bit behind schedule. Then I will start to port SPICE for Xen. Then try to enable more features of Virtio net/blk and gain better performance. That’s the plan. Time is very limited, I feel excited.

I’ve learned a lot during this period. I work together with the community. The interaction works out quite well. I discussed a lot with Xen developers and got a better understanding of Xen and QEMU, as well as Virtio itself.

Last but not least, I want to thank Stefano Stabellini, Ian Campbell, Konrad Wilk and those who helped me through my project in this hot summer. I would not have come so far without your help.

Written by liuw

July 16th, 2011 at 10:34 pm

Posted in Programming,Tech

Tagged with , ,

如何移植VirtIO到Xen的HVM上

with 4 comments

这只是一篇分析文章,没有具体代码。分析也不全面,可能会有错。权当自己的笔记。

把Anthony的QEMU-dm看了一下。以前写proposal的时候,主要看的是KVM的处理代码(kvm-all.c),马上开始干了,要先了解一下Xen的处理代码(xen-all.c)。两者的基本原理是一样的,只是一些命名和代码逻辑上有所区别。初步看来,这个阶段的工作难度相对来说不大,目前已经有比较清晰的思路了。

我在Porting VirtIO to Xen里面提到过,KVM的dispatch函数是kvm_cpu_exec(),它按照VMEXIT的原因去dispatch这些请求。Xen也必然会有自己的dispatcher。在xen_init()的最后,xen_vm_change_state_handler()被注册为QEMU的change state handler。顺着xen_vm_change_state_handler()又会注册cpu_handle_ioreq()作为event channel的handler。

cpu_handle_ioreq()会调用handle_buffered_iopage()和handle_ioreq()两个函数来处理IO请求。在handle_ioreq()中,又分为了几个IO类型。Xen的对它们的命名和KVM有所区别。但是简单对比一下就知道对应关系了。

1. KVM_EXIT_IO对应IOREQ_TYPE_PIO
2. KVM_EXIT_MMIO对应IOREQ_TYPE_COPY

VirtIO在向KVM注册设备的时候,用的就是直接写Virtual PCI的方法,也就是说会引发KVM_EXIT_MMIO。然后cpu_physical_memory_rw()会被调用,在这个函数里面,对IO mem的处理是由一些由用户自己注册的函数来完成的。

VirtIO的handler注册到IO mem区域是一个比较复杂的过程。主要涉及的文件有pci.c和exec.c。在pci.c中,有一个pci_update_mappings(),它会调用cpu_register_physical_memory();cpu_register_physical_memory()又调用cpu_register_physical_memory_offset()来更新IO mem的映射。这些handler在初始化的时候通过cpu_register_io_memory都注册到到对应的函数数组中,io_mem_write[]和io_mem_read[]。其他有关的数组还有io_mem_opaque[]。

io_mem_write[]的类型是CPUWriteMemoryFunc,io_mem_read[]的类型是CPUReadMemoryFunc,这两个类型和virtio_ioport_write()及virtio_ioport_read()的类型吻合。(有相应的typedef可查)

虽然我没有完全很详细了了解QEMU的设备注册机理,但是我初步猜测最后这些处理一定是会进入到virtio_ioport_wirte(),到实际写代码的时候会进行验证。

于是可以得出结论,注册到Xen的时候,控制逻辑应该会进入到IOREQ_TYPE_COPY中。同理,VirtIO在KVM中要触发事件(kick),也是直接写Virtual PCI的。最后的逻辑还是会进入同样的地方。现在如何注册、如何触发事件已经有思路了。实际上,这些控制逻辑要改动的地方不会太多。因为最后它们都会被dispatch到VirtIO的handler去处理。我们最切实的关注点应该是VirtIO底层的实现函数(比如说vp_notify())。目前的情况是,这些handler直接使用了KVM的的功能。据Stefano的说法,最好把这些函数换成QEMU-generic的函数,相当于再加一层glue,然后再分别为KVM和Xen实现底层的处理逻辑。

Virtual PCI要改吗?不用。就算分析到这里,我也没有对QEMU到底是怎么注册设备完全了解。但是按照目前的情况来看,了解到这里也基本够用了。我要关注的,还是VirtIO用到了哪些KVM的接口,这些是要真正着手修改的地方。Xen tools改动的地方不大。

我的打算是尽快开始写代码,没想到开发环境却怎么也没有完全搞好。先是VGA console不工作,搞了几天。Upstream QEMU和Xen QEMU不完全相同,也调了两天,现在是可以用了,但是也还是有问题。这些天解决了不少问题,但都不是核心问题,所以多少有点烦燥。对于时间的估计,我不会太乐观。貌似搞底层开发的不可预料的情况比较多,很令人烦燥。只能说尽快开始搞,多争取点时间了。

Written by liuw

May 6th, 2011 at 4:26 pm

Posted in Tech

Tagged with , , ,

x86的cpu_relax解析

without comments

内核执行的任务在很多情况下是不加锁的,只是poll某个公有变量去保证同步。再深一步,即使是使用锁,本质上也是一个poll某个公有变量的过程。这个poll的过程需要CPU一直循环等待。

要是让我这个菜鸟来写的话,循环体内大概是什么都不会做的了,XD。而x86的内核中一般是调用cpu_relax()。这个函数又是何方神圣呢?

实际上,这个函数很简单。

#define cpu_relax() rep_nop()
static always_inline void rep_nop(void)
{
        asm volatile ( "rep;nop" : : : "memory" );
}

自旋锁里面也有rep;nop这个语句。而我很蛋痛地在想,为什么是rep;nop而不是nop;nop而不是nop;nop;nop…;nop,反正都是什么都不做嘛,为什么偏偏要选这个?众所周知,在内核这个层次,基本上每一行代码都是最优的,做出这样的选择必定事出有因。

rep;nop的机器码是f3 90,其实就是pause指令的机器码,相当于pause的一个“别名”,这是巧合吗?pause指令又是干什么的呢?

从Intel的手册里面翻出来一段话:

Improves the performance of spin-wait loops. When executing a “spin-wait loop,” a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.

An additional fucntion of the PAUSE instruction is to reduce the power consumed by a Pentium 4 processor while executing a spin loop.

简单点说,用pause可以提示CPU接下来的指令序列是用来自旋等待,就不用做memory reorder了,cache什么的也不用废掉了——要知道,cache是很宝贵的资源啊。这指令还有附送的功能——减少能耗。其实内核代码最根本的要求就是:快,快,更加快,这条指令还有附送功能,所以何乐而不为呢?

那么为什么不直接写pause而要写rep;nop呢?理论上是等价的,但是实际上为什么不这样做,不好意思,不清楚。但是可以确定是的pause是Pentium 4才引入的,也许大家比较怀旧所以还用rep;nop也说不定。

So,以后写应用程序而又蛋痛写了循环等待的话,不妨也用用pause吧。不过我想会在应用程序中写循环等待这么傻的代码的程序员,应该也想不到用pause去节能减排兼提速了吧,伤脑筋。

Written by liuw

March 28th, 2011 at 4:46 pm

Posted in Tech

Tagged with , , , , ,

ASUS WL-520GU锐捷认证成功

without comments

晚上回来试了一下今天静态编译的newstar和ruijieclient,记录如下。

newstar工作良好,根据路由器的设置,vlan1是WAN口,newstar配置一下就可以成功认证了。它需要写etc目录,但是原来的镜像是不可写的,所以我再挂了一个可写的目录到etc,搞定。

被我寄予厚望的ruijieclient没有搞成,通信是正常的了,但是认证不能通过,返回的信息是“不是使用锐捷的客户端”。可能一些认证包还要再hack一下。以后再说。

目前newstar凑合可以用用,就先这样吧。ruijieclient有空再抓包分析一下了。

现在本子就是用的无线上的网。我用的是DD-WRT build 13064,目前发现的问题是在开启user mac filter无线会时不时失效,不知道是软件的bug还是硬件的bug。

以后再折腾了。

Written by liuw

March 17th, 2011 at 11:07 pm

Posted in Tech

Tagged with , , , ,

为ASUS WL-520GU交叉编译ruijieclient

without comments

一个很悲剧的事情是,上一篇文章中提到的toolchain,其实是用来编译内核的。真正应该用的是OpenWRT的SDK。所以先把OpenWRT的SDK搞下来。这个文章是事后所记,有点错漏在所难免。

$ svn co svn://svn.openwrt.org/openwrt/branches/8.09
$ make menuconfig
$ make

menuconfig的时候也用不着选太多的东西。记得把libpcap选上(M还是*随意),ruijieclient会用到相关的头文件和库。

把cross compiler所在的目录加入到PATH变量中。

$ export PATH=$PATH:/PATH/TO/YOUR/8.09/staging_dir/toolchain-mipsel_gcc3.4.6/bin

假如用的是newstar,这里已经可以直接make了。

但是要编译ruijieclient还不行,因为ruijieclient用的是autoconf来生成Makefile,配置全面点,也麻烦点。

先把ruijieclient搞下来。

$ git clone git@github.com:microcai/ruijieclient.git
$ cd ruijieclient
$ ./autogen.sh

运行完之后会生成configure脚本文件。假如现在直接make CC=mipsel-linux-gcc的话,会报告找不到pcap.h的错误。而实际上pcap.h在toolchain中是有的,位于/PATH/TO/YOUR/8.09/staging_dir/mipsel/usr/include目录下。mipsel-linux-gcc的search path是在toolchain里面gcc的include目录下。所以先做点软链接。

$ cd /PATH/TO/YOUR/8.09/staging_dir/toolchain-mipsel_gcc3.4.6/include
$ mkdir pcap
$ cd pcap
$ ln -s /PATH/TO/YOUR/8.09/staging_dir/mipsel/usr/include/pcap.h pcap.h
$ cd ..
$ ln -s /PATH/TO/YOUR/8.09/staging_dir/mipsel/usr/include/pcap-bpf.h pcap-bpf.h
$ ln -s /PATH/TO/YOUR/8.09/staging_dir/mipsel/usr/include/pcap-namedb.h pcap-namedb.h

现在还不能做configure,因为只有头文件,没有pcap的archive或者shared object也是不行的。在编译SDK之后,SDK目录下的/PATH/TO/YOUR/8.09/build_dir/mipsel/libpcap-0.9.8目录下面会有生成的a文件和so文件,把它们拷贝到toolchain的lib目录下面。

最后用下面的configure来生成Makefile。

$ ./configure --host=mipsel-openwrt-linux-uclibc CC=mipsel-linux-gcc LD=mipsel-linux-ld AR=mipsel-linux-ar RANLIB=mipsel-linux-ranlib LDFLAGS=-lpcap

好了,现在终于成功生成Makefile了。真累啊。

那么make吧……不行,又出错了。ruijieclient是针对2.6内核写的,而SDK里面的内核还是2.4版本的。所以内核头文件也有不同。packetsender.c里面调用socket的时候用了SOCK_CLOEXEC标志,2.4里面没有。我觉得去掉也没问题的,因为这个程序不是多线程的,也不在连接后生成子进程。这个标志的具体含义自己看manpage去吧。

好了,现在make也通过了。生成的ruijieclient是dynamic linked的,也没有stripped掉。要static linked和strip的ruijieclient是很简单的。

接下来就是要到路由上测试程序能不能用了。之所以花这么多时间折腾这个,是觉得ruijieclient做得比较完善,希望会比原始的newstar好用点吧。

Written by liuw

March 17th, 2011 at 4:20 pm

Posted in Tech

Tagged with , , , ,

Xen可写页表方式的变化

without comments

以前看Paper的时候,总说Xen的可写页表机制是如下操作的:

1. DomU写L1页表,由于L1页表被Xen设置了只读位,所以会引发page fault;
2. Xen捕获page fault,然后把L1页表从进程页表中脱离下来,并且设置可写;
3. DomU自行更新页表;
4. Xen检查更新情况,合法则把L1页表从新挂回进程页表中;
5. DomU正常执行。

但是现在我们要自己写Paper了,却发现已经不是那么一回事了。我们用的版本是Xen-3.3.0,操作方法如下:

1. DomU写L1页表,由于L1页表被Xen设置了只读位,所以会引发page fault;
2. Xen捕获page fault,并且为DomU模拟出写操作;
3. DomU正常执行。

前面一个方式,可以减少陷入的次数,速度和效率相对有保证,内核代码也不用改那么多。

后面一个方式,慢是慢了点,但是对于少量页表更新还是可以接受的;大量的更新应该显式地发Hypercall。

具体也不知道在什么版本开始改成这样了。还好写的的时候自己注意了一下,不然真会出笑话了。做事还是较真点好。

Written by liuw

March 13th, 2011 at 12:28 pm

Posted in Tech

Tagged with , , ,

准备Hack一下ASUS WL-520GU

with 3 comments

ASUS的WL-520GU是很经典的一款路由器,不过苦于没米迟迟没有入手。现在把实验室淘汰下来的拿回来玩玩。说实话,已经很久没有折腾过这样的东西了,是时候活动活动筋骨了。当然,现在也有更多更好的路由可以让大家去hack了。我要求不高,主要是想玩玩,把锐捷编译进去,这样就可以用无线的校园网了,方便不少。

这一台520GU原来已经刷过Tomato了,但是我比较讨厌Tomato这个名字(Tomato表示很无辜),所以打算换成DD-WRT。刷机过程很简单,上dd-wrt.com上找到对应的firmware,然后直接通过Web刷即可。网上说一般不要通过Wifi来刷,但是我不care,因为我觉得信号足够好,无所谓了——说到底,路由就在旁边,信号想不好都难。当然,要是人品不好,wifi驱动不稳定就另说了。

其实后面搞锐捷才是大头。国内的公司,什么都喜欢搞封闭的,因为这样才是他们保持优势的最好办法。锐捷的官方Linux客户端是从来都没有更新过的。还好,WHU及HUST都有人做了Linux下面的supplicant,运气好的话我直接cross compile过去就可以了。运气不好的话还要再hack。

目前正在下DD-WRT的toolchain,竟然有700多M,坑爹啊。按照现在的速度,不下个一两天是别想下完了。

Written by liuw

March 10th, 2011 at 8:33 pm

Posted in Tech

Tagged with , , , ,