Archive for the ‘Programming’ Category
Oops debug小经验
常在河边走哪能不湿脚,写程序没有遇到bug那是不可能的。内核菜鸟写代码,出个kernel oops那更是家常便饭。
从我目前遇到的各种oops来看,debug首先要检验能否稳定重现bug。
若不能稳定重现(触发条件不一,Oops信息不一),那么恭喜了,基本是出现竞态了。这种问题可大可小,当然最后的问题肯定是在自己的代码里面的(前提是其他模块都是稳定的)。这个时候当然是得先把code path都走一遍,检查是不是逻辑有问题;之后就是检查各种锁了。
若能稳定重现,那么相对好办点。慢慢用printk定位到出问题的语句就可以了。objdump反汇编一下,配合oops里面的stack trace信息看看大概是哪个语句出问题了,不过个人觉得帮助不大。更加高级点的工具我基本不会用,囧。
最NB的debug工具还是printk,因为它在任何上下文都是健壮的,这是一个多么伟大的特性啊!(囧)
什么?没有stack trace?你没有开内核的debug功能?当我啥都没说吧。
一个简单的Kernel用的内存池
经过几天的编程,netback现在已经转移到thread-per-vif的模型了,非常有限的测试表示在某些情况下新netback的性能还比旧的好的点(至少没有比原来的差)。
为了限制新模型的netback内存占用情况,需要实现一个统一的内存池。本着能简单就简单的原则,用一个free list来管理可用的entry,实现简单的get one和put one操作。以后可能还会用到,先记下。不说实现得有多好,至少是够用了。
所有的页都是通过alloc_page申请的,所以初始状态下它们的reference count应该是1。
锁的使用尽量少,put的时候没有加锁,因为写操作是原子的,真正的页面回收工作推迟到池中没有空闲页面的时候再进行。但是假如要实现get n和put n的操作,那么put的时候也必须要加锁。
#define PAGE_POOL_ORDER 10
#define PAGE_POOL_SIZE (1 << PAGE_POOL_ORDER)
#define PAGE_POOL_NR_RESERVED 0
typedef uint32_t idx_t;
#define FREE_LIST_END 0xffffffff
#define INVAL_IDX FREE_LIST_END
#define RECLAIMABLE (INVAL_IDX - 1)
struct page_pool {
struct page **pool;
idx_t *free_list;
idx_t free_head;
int free_count;
spinlock_t lock;
};
static struct page_pool *page_pool;
int get_free_entry(void)
{
unsigned long flag;
idx_t head;
int idx;
spin_lock_irqsave(&page_pool->lock, flag);
if (page_pool->free_count == 0) {
/* XXX liuw: reclaim pages */
/* free_list[idx] == RECLAIMABLE AND */
/* page->_count == 1 */
for (idx = 0; idx < PAGE_POOL_SIZE; idx++)
if (page_pool->free_list[idx] == RECLAIMABLE &&
page_count(page_pool->pool[idx]) == 1) {
/* __reclaim_free_entry(idx); */
page_pool->free_list[idx] =
page_pool->free_head;
page_pool->free_head = idx;
page_pool->free_count++;
}
if (likely(page_pool->free_count > 0))
goto found_free_page;
spin_unlock_irqrestore(&page_pool->lock, flag);
return -ENOSPC;
}
found_free_page:
idx = head = page_pool->free_head;
page_pool->free_count--;
page_pool->free_head = page_pool->free_list[head];
page_pool->free_list[head] = FREE_LIST_END;
spin_unlock_irqrestore(&page_pool->lock, flag);
return idx;
}
static inline void put_free_entry(idx_t idx)
{
page_pool->free_list[idx] = RECLAIMABLE;
}
Xen虚拟机调试的小技巧
1. xenctx分析上下文
在VM的配置文件中加入
on_reboot="preserve" on_crash="preserve"
等配置。出问题之后可以用xenctx取得VM的上下文。
2. 调试QEMU
在VM配置文件中使用device_model_override指定一个脚本作为DM,这个DM中包括如下的语句
echo $@ > /tmp/qemu-dm sleep 1h
然后再手动用gdb启动QEMU后端。注意xl有一个timeout,所以动作要够快。或者自己把xl的timeout改一下。
Status update of Virtio on Xen project
Hi everybody, it’s midterm of Google Summer of Code now, let me tell you what I’ve done and learned during this period.
I started working on the project in the community bonding period. I took Virtio on Xen HVM as my warming up phase, which would help me understand QEMU and Virtio implementation better. Luckily, it did not require much work to get Virtio work on Xen HVM. At the end of the community bonding period, I wrote a patch to enable MSI injection for HVM guest, which has been applied to the tree.
Then I started to work on Virtio for pure PV. That’s not trivial. I spent lots of time trying to implement a Virtio transport layer with Xenbus, event channel and grant table, which is called virtio_xenbus (corresponding to current Virtio transport layer virtio_pci, which utilizes virtual PCI bus). The new transport layer must retain same behavior of the old one. However, one fundamental difference between evtchn and vpci is that, vpic works in a synchronous way while evtchn is born asynchronous. I got inspired by xen-pcifront and xen-pciback and finally solved this problem. Ah, a working transport layer finally.
But porting Virtio for pure PV needs more than a working transport layer. Vring, which is responsible for storing data, also needs some care. The original implementation uses kmalloc() to allocate the ring. It is OK to use kmalloc to get physical contiguous memory in HVM. However, Xen PV backend needs to access machine contiguous memory. So we have to enable Xen’s software IOTLB and replace kmalloc() with DMA API. Also, the physical address in scatter gather list should be replaced with machine address. So here we get a Vring implementation for pure PV guest.
Is that all? No. One feature we need to disable is the indirect buffer support. Because this feature causes specific driver to allocate buffers with kmalloc() in a much upper level. I tangled with this problem for sometime, finding that I would rather leave those drivers alone than break them. So I chosed to disable this feature at the moment. But this feature is critical to good performance, so I may try to enable it someday.
Good, we finally have our foundation ready! Let’s start to tangle with specific drivers. I chose Virtio net driver as a start. Every driver has its own features. As mentioned above, we should avoid allocating buffer with kmalloc() in driver level, so the CTRL_VQ feature needs to be disabled. In fact, I have no driver features enabled at the moment. What makes me really happy is that Virtio net almost works out of the box. I just want to make sure things work, pre-mature optimization is evil.
What to do next? Virtio blk is my next goal. Hopefully it would not take too long because I’m a bit behind schedule. Then I will start to port SPICE for Xen. Then try to enable more features of Virtio net/blk and gain better performance. That’s the plan. Time is very limited, I feel excited.
I’ve learned a lot during this period. I work together with the community. The interaction works out quite well. I discussed a lot with Xen developers and got a better understanding of Xen and QEMU, as well as Virtio itself.
Last but not least, I want to thank Stefano Stabellini, Ian Campbell, Konrad Wilk and those who helped me through my project in this hot summer. I would not have come so far without your help.
开始VirtIO for pure PV
上周给Xen写了几个Patch。
1. 简单的typo fix,没什么好说的。
2. 为模拟设备注入MSIX而写的HVMOP,进入staging。QEMU部分也发到qemu-devel了,但是QEMU开发者说模拟MSIX的注入还要再讨论一下,那我就等等吧。
3. 给libxl加上VirtIO Disk的支持,这个还要再完善。
现在其实VirtIO for HVM已经差不多了。下面要开始最难的VirtIO for pure PV的工作了。
Stefano说we don’t exactly know how long it is going to take,要考虑的东西还是挺多的:
1. Xenstore里面要写什么东西。
2. VirtIO要怎么初始化。
3. 底层相关实现的替换(evtchn等)。
而这些工作,又要和现有的功能和平相处。所以还要再了解现有的设计,才能提出合理的设计。
这两天就先看看现有代码。一是Linux kernel里面VirtIO的代码,二是Xen PV初始化的代码,三是参考PV net等等的Xenstore参数形式。再和Stefano及Konrad再深入讨论一下。
VirtIO for HVM进展
虽然我说要尽快尽快开始,但是实际上进展不快。因为很多时候都是在调其他的bug。Xen unstable和SeaBIOS配合的时候,对IRQ配置有分歧,所以最后IRQ的注入有问题。
由于对IO-APIC和LAPIC不熟悉,在这里卡了很久。最后还是Stefano把这个bug解决了。惭愧。
这个bug解决之后,反而VirtIO for HVM的主要问题已经解决了。只要在guest command line加上”pci=nomsi”,其实VirtIO网卡已经可以使用了。
Stefano告诉我,下一步是把VirtIO disk for HVM搞好——这得等到Ian Jackson把libxenlight重构之后才可以进行,主要是一些配置的parse和driver的问题。
再下一步,就是把MSI for emulated devices做好。现在Xen只支持向passthrough的设备注入MSI,但对模拟设备没有接口——这也是为什么前面说要加nomsi的原因。
其实debug好像也没有什么好办法,都是加printk之类的。但是别人为什么debug这么快,我为什么这么慢,这真是个问题——我对底层还不是特别了解。以后还要再加油。
Porting VirtIO to Xen
by Wei Liu <liuw #SPAMFREE# liuw #DOT# name>
Table of Contents
1 Overview
VirtIO is a unified paravirtualized IO framework created by Rusty Russell. It’s not hypervisor-specified, but mainly used in KVM. It is possible to port VirtIO to Xen without much effort.
This article is organized as serveral sections. Section 1 discovers how VirtIO is used in KVM. I will pay much attention to code analysis. Section 2 discusses how we can port VirtIO to Xen, both for normal PV and PV-on-HVM. Section 3 illustrates what performance tests will be done. Section 4 introduces porting plan for Spice (spice-space.org).
2 How VirtIO is used in KVM
This topic can be divided into three subtopics.
2.1 The role of KVM
KVM acts as hypervisor. It is responsible for capturing events then passing events to QEMU. I’m not going to illustrate how it works because this is out of our scope.
2.2 How VirtIO is used in Linux kernel
There are two key perspectives in creating a cross-domain communication channel:
- how to deliver events, e.g. Xen’s event channel, KVM’s handler to trap VM_EXIT and event dispatcher.
- how to share data, e.g. Xen’s ring buffer, VirtIO’s virtqueue.
It is obvious that Xen provide these two perspectives out of the box. However, VirtIO only provides mechanism to data-sharing, and notification is left for the user. VirtIO also registers as a bus inside kernel, which resembles XenBus.
The core structure in VirtIO is virtqueue. It contains a vring (just like ring buffer) and some other information. One thing worth mentioning is that virtqueue alse wraps up two important function pointers, one for notification, which is exactly what we need to replace, the other is for callback, which is somewhat irrelevant to porting.
Let’s take virtio network device as an example.
First thing first, virtio network in Linux kernel is implemented as a PCI device, so it is necessary to implement a virtio pci bus. See drivers/virtio/virtio_pci.c for details. When porting to Xen, it might be necessary to replace this PCI bus with XenBus. (However, it might be left unchanged in PV-on-HVM, we need to do a VM_EXIT and trap into hypervisor anyway.)
Source of virtio network lies in drivers/virtio_net.c . virtnet_probe() is responsible for probing. In this function, virtnet is setup with at least two vrings “input” and “output” and an optional vring for “control”. Callback for “input” is skb_recv_done() and callback for “output” is skb_xmit_done() .
After calling vdev->config->find_vqs(), these 2 or 3 vrings are setup. If we trace down this function – it lies in virtio_pci.c as vp_find_vqs() – we can find that it consequently calls vp_try_to_find_vqs(), setup_vq() and request_irq() .
In setup_vq(), the framework actually allocates the vring for data sharing. It is worth noting that the notify function is vp_notify(), which directly writes queue_index to (vp_dev->ioaddr+VIRTIO_PCI_QUEUE_NOTIFY) to generate a VM_EXIT. So that hypervisor can catch the event and dispatch it.
And the requested irqs are used to invoke the callback functions, i.e. skb_xmit_done() and skb_recv_done() .
2.3 How VirtIO is used in QEMU
QEMU runs on top of KVM and it interacts with KVM via /dev/kvm . KVM has to cooperate with QEMU. Actually, a virtual machine in KVM is merely a process.
VM instructions execute natively on CPU. However, when a VM executes some sensitive instruction, it will be trapped by KVM. Then KVM passes this instruction to QEMU to emulate / handle it.
KVM hands over the instruction to QEMU in kvm_cpu_exec(), which is in kvm-all.c . There is a `switch` on the exit_reason. Exit reasons include (port) IO, interrupt and MMIO, etc.
QEMU has full access of guest’s memory. But it has to grab the virtqueue inside VM first to communicate with vritio_net. When VM calls setup_vq(), it voluntarily writes virtqueue’s address to (vp_dev->ioaddr+VIRTIO_PCI_QUEUE_PFN), which will be trapped by KVM. KVM passes it to QEMU. QEMU then calls virtio_ioport_write() -> virtio_queue_set_addr() to set VRing, which is the control structure used in VirtIO in QEMU. This is how QEMU and VM create their data-sharing channel.
As for notification channel, it seems much easier. As mentioned above, VM writes index to VIRTIO_PCI_QUEUE_NOTIFY. It is trapped by virtio_ioport_write(), then passed to virtio_queue_notify(). In virtio net’s case, this request is finally handled by virtio_net_handle_rx() or virtio_net_handle_tx_{timer,bh}() .
There are many VirtIO-related files reside in the hw/ subdirectory. And also many other files in Linux kernel’s directory.
3 How to port VirtIO to Xen
Now that we’ve got our first impression on VirtIO. It time to discuss how we can port it to Xen.
3.1 PV-on-HVM
It is obvious that in the PV-on-HVM case, things are more or less the same as they are in KVM. Xen traps VM_EXIT and passes exception to QEMU. QEMU emulates. Then Xen sends result, VM resumes running.
3.1.1 How to grab virtqueue address
Xen utilizes QEMU as KVM does. So Linux kernel can stay untouched. Xen captures guest’s write to PCI configuration space, then QEMU will handle changes in VirtIO configuration.
3.1.2 How to deliver event
In VirtIO’s current virtual PCI implementation ($QEMU/hw/virtio-pic.c), it uses KVM’s event notification functions like kvm_set_ioeventfd_pio_word() and kvm_has_many_ioeventfds(). It is necessary to replace them with corresponding implementation in Xen. Anthony’s QEMU-dm ships with xen-all.c , it might give me some hints on implementation.
3.2 Normal PV
When it comes to normal PV case, we will have to do things in different way. I will try my best to detail what should be done and how it is done. However, this is just a rough design, things may change when implementing.
3.2.1 How to grab virtqueue address
In Xen’s covention, it is common for Dom0 / DomU to use Xenstore to expose their public information like netfront/netback, blkfront/blkback, etc. So it is a good idea to use Xenstore to expose VirtIO information. Xenstore’s well-defined API will greatly reduce work needed.
Down to implementation level, it is necessary to replace virtual PCI bus with XenBus. VirtIO utilizes virtual PCI to configure network device. However, in normal PV case, it is not necessary to expose a virtual PCI device to VM. We can follow the pattern how netfront and netback establish their channel.
QEMU-dm from Anthony has functions to manipulate Xenstore, that should help a lot. It also has xen_nic.c, which can greatly inspire how I can implement a VirtIO network for Xen.
3.2.2 How to deliver event
No doubt that event channel is the best choice. Anthony’s QEMU-dm contains a file named xen_backend.c, which is used for event handling. Linux kernel has event channel handling functions, too. (drivers/xen/{events,evtchn}.c)
So, just replace any notification-related function with Xen’s implementation. That’s the plan.
3.3 Other stuff
In PV-on-HVM case, QEMU needs to emulate. In normal PV case, no emulation is needed. Either case, QEMU works as backend dispatcher for VirtIO. Once the channel between two VMs are established, QEMU is supposed to work out of the box. However, I can’t be too optimistic here, it might require some work, such as rewriting and debugging some functions.
3.4 Knowledge required
To be honest, QEMU’s concept (like proxy / virtual device management) is somewhat strange to me. There are some high level design document, but they are just too high-level. A thorough understanding of QEMU’s internal is required.
Knowledge of hareware virtualization is also required. I need to understand how Xen implements HVM interface and choose the right function for certain functionality.
Knowledge of XenBus configuration is a must in normal PV porting. I’ve read about it before, so this is the easier part.
4 Performance tests
Performance tests will be run with industrial standard software like kernbench, ioperf and netperf. Testsuits will be run on several different configurations:
- Native Linux, CPU, disk and network.
- Xen with normal PV VirtIO support, CPU, disk and network.
- Xen with PV-on-HVM VirtIO support, CPU, disk and network.
- Xen with original PV support, CPU, disk and network.
- KVM with VirtIO support, CPU, disk and network.
And a short report will be written based on the result, which compares between outcoming data and analyzes advantages / disadvantages between configurations.
5 Porting of Spice
Spice will be ported to Xen’s HVM environment as a real-world testsuit. According to its design, Spice communicates with QEMU via Virtual Device Interface (VDI). Spice client and server run entirely in userland (correct me if I’m wrong, I’m not Spice expert). If we are able to run QEMU with QXL or any other VirtIO devices on Xen, it would not be so hard to get Spice running on Xen.
AFAIK, QXL in QEMU (hw/qxl.c) uses its own paravirtualized ring implementation. It also use qemu_set_irq() to deliver event. So the main idea is to replace this implementation with Xen’s ones, which is already done in porting VirtIO.
The plan is to run Spice with our modified QEMU and eliminate any bugs encountered.
6 Reference
- Linux kernel 2.6.38.2
- QEMU-dm, git://xenbits.xen.org/people/aperard/qemu-dm.git
- Xen-unstable, git://xenbits.xen.org/xen-unstable.git
- Spice project, spice-space.org
HTML generated by org-mode 6.21b in emacs 23