Liuw's Thinkpad

想要赢就先学会输,想要成功就先学会失败

Archive for the ‘Programming’ Category

Oops debug小经验

without comments

常在河边走哪能不湿脚,写程序没有遇到bug那是不可能的。内核菜鸟写代码,出个kernel oops那更是家常便饭。

从我目前遇到的各种oops来看,debug首先要检验能否稳定重现bug。

若不能稳定重现(触发条件不一,Oops信息不一),那么恭喜了,基本是出现竞态了。这种问题可大可小,当然最后的问题肯定是在自己的代码里面的(前提是其他模块都是稳定的)。这个时候当然是得先把code path都走一遍,检查是不是逻辑有问题;之后就是检查各种锁了。

若能稳定重现,那么相对好办点。慢慢用printk定位到出问题的语句就可以了。objdump反汇编一下,配合oops里面的stack trace信息看看大概是哪个语句出问题了,不过个人觉得帮助不大。更加高级点的工具我基本不会用,囧。

最NB的debug工具还是printk,因为它在任何上下文都是健壮的,这是一个多么伟大的特性啊!(囧)

什么?没有stack trace?你没有开内核的debug功能?当我啥都没说吧。

Written by liuw

January 6th, 2012 at 3:37 pm

Posted in Programming

Tagged with , ,

一个简单的Kernel用的内存池

without comments

经过几天的编程,netback现在已经转移到thread-per-vif的模型了,非常有限的测试表示在某些情况下新netback的性能还比旧的好的点(至少没有比原来的差)。

为了限制新模型的netback内存占用情况,需要实现一个统一的内存池。本着能简单就简单的原则,用一个free list来管理可用的entry,实现简单的get one和put one操作。以后可能还会用到,先记下。不说实现得有多好,至少是够用了。

所有的页都是通过alloc_page申请的,所以初始状态下它们的reference count应该是1。

锁的使用尽量少,put的时候没有加锁,因为写操作是原子的,真正的页面回收工作推迟到池中没有空闲页面的时候再进行。但是假如要实现get n和put n的操作,那么put的时候也必须要加锁。

#define PAGE_POOL_ORDER         10
#define PAGE_POOL_SIZE          (1 << PAGE_POOL_ORDER)
#define PAGE_POOL_NR_RESERVED   0

typedef uint32_t idx_t;
#define FREE_LIST_END           0xffffffff
#define INVAL_IDX               FREE_LIST_END
#define RECLAIMABLE             (INVAL_IDX - 1)

struct page_pool {
        struct page      **pool;
        idx_t             *free_list;
        idx_t              free_head;
        int                free_count;
        spinlock_t         lock;
};

static struct page_pool *page_pool;

int get_free_entry(void)
{
        unsigned long flag;
        idx_t head;
        int idx;

        spin_lock_irqsave(&page_pool->lock, flag);

        if (page_pool->free_count == 0) {
                /* XXX liuw: reclaim pages */
                /* free_list[idx] == RECLAIMABLE AND */
                /* page->_count == 1 */
                for (idx = 0; idx < PAGE_POOL_SIZE; idx++)
                        if (page_pool->free_list[idx] == RECLAIMABLE &&
                            page_count(page_pool->pool[idx]) == 1) {
                                /* __reclaim_free_entry(idx); */
                                page_pool->free_list[idx] =
                                        page_pool->free_head;
                                page_pool->free_head = idx;
                                page_pool->free_count++;
                        }
                if (likely(page_pool->free_count > 0))
                        goto found_free_page;
                spin_unlock_irqrestore(&page_pool->lock, flag);
                return -ENOSPC;
        }

found_free_page:
        idx = head = page_pool->free_head;
        page_pool->free_count--;
        page_pool->free_head = page_pool->free_list[head];
        page_pool->free_list[head] = FREE_LIST_END;

        spin_unlock_irqrestore(&page_pool->lock, flag);

        return idx;
}

static inline void put_free_entry(idx_t idx)
{
        page_pool->free_list[idx] = RECLAIMABLE;
}

Written by liuw

December 12th, 2011 at 11:29 pm

Posted in Programming

Tagged with ,

Xen虚拟机调试的小技巧

without comments

1. xenctx分析上下文

在VM的配置文件中加入

on_reboot="preserve"
on_crash="preserve"

等配置。出问题之后可以用xenctx取得VM的上下文。

2. 调试QEMU

在VM配置文件中使用device_model_override指定一个脚本作为DM,这个DM中包括如下的语句

echo $@ > /tmp/qemu-dm
sleep 1h

然后再手动用gdb启动QEMU后端。注意xl有一个timeout,所以动作要够快。或者自己把xl的timeout改一下。

Written by liuw

July 27th, 2011 at 2:04 pm

Posted in Programming

Tagged with , , ,

Status update of Virtio on Xen project

without comments

Hi everybody, it’s midterm of Google Summer of Code now, let me tell you what I’ve done and learned during this period.

I started working on the project in the community bonding period. I took Virtio on Xen HVM as my warming up phase, which would help me understand QEMU and Virtio implementation better. Luckily, it did not require much work to get Virtio work on Xen HVM. At the end of the community bonding period, I wrote a patch to enable MSI injection for HVM guest, which has been applied to the tree.

Then I started to work on Virtio for pure PV. That’s not trivial. I spent lots of time trying to implement a Virtio transport layer with Xenbus, event channel and grant table, which is called virtio_xenbus (corresponding to current Virtio transport layer virtio_pci, which utilizes virtual PCI bus). The new transport layer must retain same behavior of the old one. However, one fundamental difference between evtchn and vpci is that, vpic works in a synchronous way while evtchn is born asynchronous. I got inspired by xen-pcifront and xen-pciback and finally solved this problem. Ah, a working transport layer finally.

But porting Virtio for pure PV needs more than a working transport layer. Vring, which is responsible for storing data, also needs some care. The original implementation uses kmalloc() to allocate the ring. It is OK to use kmalloc to get physical contiguous memory in HVM. However, Xen PV backend needs to access machine contiguous memory. So we have to enable Xen’s software IOTLB and replace kmalloc() with DMA API. Also, the physical address in scatter gather list should be replaced with machine address. So here we get a Vring implementation for pure PV guest.

Is that all? No. One feature we need to disable is the indirect buffer support. Because this feature causes specific driver to allocate buffers with kmalloc() in a much upper level. I tangled with this problem for sometime, finding that I would rather leave those drivers alone than break them. So I chosed to disable this feature at the moment. But this feature is critical to good performance, so I may try to enable it someday.

Good, we finally have our foundation ready! Let’s start to tangle with specific drivers. I chose Virtio net driver as a start. Every driver has its own features. As mentioned above, we should avoid allocating buffer with kmalloc() in driver level, so the CTRL_VQ feature needs to be disabled. In fact, I have no driver features enabled at the moment. What makes me really happy is that Virtio net almost works out of the box. I just want to make sure things work, pre-mature optimization is evil.

What to do next? Virtio blk is my next goal. Hopefully it would not take too long because I’m a bit behind schedule. Then I will start to port SPICE for Xen. Then try to enable more features of Virtio net/blk and gain better performance. That’s the plan. Time is very limited, I feel excited.

I’ve learned a lot during this period. I work together with the community. The interaction works out quite well. I discussed a lot with Xen developers and got a better understanding of Xen and QEMU, as well as Virtio itself.

Last but not least, I want to thank Stefano Stabellini, Ian Campbell, Konrad Wilk and those who helped me through my project in this hot summer. I would not have come so far without your help.

Written by liuw

July 16th, 2011 at 10:34 pm

Posted in Programming,Tech

Tagged with , ,

开始VirtIO for pure PV

without comments

上周给Xen写了几个Patch。

1. 简单的typo fix,没什么好说的。
2. 为模拟设备注入MSIX而写的HVMOP,进入staging。QEMU部分也发到qemu-devel了,但是QEMU开发者说模拟MSIX的注入还要再讨论一下,那我就等等吧。
3. 给libxl加上VirtIO Disk的支持,这个还要再完善。

现在其实VirtIO for HVM已经差不多了。下面要开始最难的VirtIO for pure PV的工作了。

Stefano说we don’t exactly know how long it is going to take,要考虑的东西还是挺多的:

1. Xenstore里面要写什么东西。
2. VirtIO要怎么初始化。
3. 底层相关实现的替换(evtchn等)。

而这些工作,又要和现有的功能和平相处。所以还要再了解现有的设计,才能提出合理的设计。

这两天就先看看现有代码。一是Linux kernel里面VirtIO的代码,二是Xen PV初始化的代码,三是参考PV net等等的Xenstore参数形式。再和Stefano及Konrad再深入讨论一下。

Written by liuw

May 30th, 2011 at 8:32 am

Posted in Programming

Tagged with , ,

VirtIO for HVM进展

with 4 comments

虽然我说要尽快尽快开始,但是实际上进展不快。因为很多时候都是在调其他的bug。Xen unstable和SeaBIOS配合的时候,对IRQ配置有分歧,所以最后IRQ的注入有问题。

由于对IO-APIC和LAPIC不熟悉,在这里卡了很久。最后还是Stefano把这个bug解决了。惭愧。

这个bug解决之后,反而VirtIO for HVM的主要问题已经解决了。只要在guest command line加上”pci=nomsi”,其实VirtIO网卡已经可以使用了。

Stefano告诉我,下一步是把VirtIO disk for HVM搞好——这得等到Ian Jackson把libxenlight重构之后才可以进行,主要是一些配置的parse和driver的问题。

再下一步,就是把MSI for emulated devices做好。现在Xen只支持向passthrough的设备注入MSI,但对模拟设备没有接口——这也是为什么前面说要加nomsi的原因。

其实debug好像也没有什么好办法,都是加printk之类的。但是别人为什么debug这么快,我为什么这么慢,这真是个问题——我对底层还不是特别了解。以后还要再加油。

Written by liuw

May 19th, 2011 at 8:14 pm

Posted in Programming

Tagged with , , , ,

Porting VirtIO to Xen

without comments

by Wei Liu <liuw #SPAMFREE# liuw #DOT# name>

1 Overview

VirtIO is a unified paravirtualized IO framework created by Rusty Russell. It’s not hypervisor-specified, but mainly used in KVM. It is possible to port VirtIO to Xen without much effort.

This article is organized as serveral sections. Section 1 discovers how VirtIO is used in KVM. I will pay much attention to code analysis. Section 2 discusses how we can port VirtIO to Xen, both for normal PV and PV-on-HVM. Section 3 illustrates what performance tests will be done. Section 4 introduces porting plan for Spice (spice-space.org).

2 How VirtIO is used in KVM

This topic can be divided into three subtopics.

2.1 The role of KVM

KVM acts as hypervisor. It is responsible for capturing events then passing events to QEMU. I’m not going to illustrate how it works because this is out of our scope.

2.2 How VirtIO is used in Linux kernel

There are two key perspectives in creating a cross-domain communication channel:

  1. how to deliver events, e.g. Xen’s event channel, KVM’s handler to trap VM_EXIT and event dispatcher.
  2. how to share data, e.g. Xen’s ring buffer, VirtIO’s virtqueue.

It is obvious that Xen provide these two perspectives out of the box. However, VirtIO only provides mechanism to data-sharing, and notification is left for the user. VirtIO also registers as a bus inside kernel, which resembles XenBus.

The core structure in VirtIO is virtqueue. It contains a vring (just like ring buffer) and some other information. One thing worth mentioning is that virtqueue alse wraps up two important function pointers, one for notification, which is exactly what we need to replace, the other is for callback, which is somewhat irrelevant to porting.

Let’s take virtio network device as an example.

First thing first, virtio network in Linux kernel is implemented as a PCI device, so it is necessary to implement a virtio pci bus. See drivers/virtio/virtio_pci.c for details. When porting to Xen, it might be necessary to replace this PCI bus with XenBus. (However, it might be left unchanged in PV-on-HVM, we need to do a VM_EXIT and trap into hypervisor anyway.)

Source of virtio network lies in drivers/virtio_net.c . virtnet_probe() is responsible for probing. In this function, virtnet is setup with at least two vrings “input” and “output” and an optional vring for “control”. Callback for “input” is skb_recv_done() and callback for “output” is skb_xmit_done() .

After calling vdev->config->find_vqs(), these 2 or 3 vrings are setup. If we trace down this function – it lies in virtio_pci.c as vp_find_vqs() – we can find that it consequently calls vp_try_to_find_vqs(), setup_vq() and request_irq() .

In setup_vq(), the framework actually allocates the vring for data sharing. It is worth noting that the notify function is vp_notify(), which directly writes queue_index to (vp_dev->ioaddr+VIRTIO_PCI_QUEUE_NOTIFY) to generate a VM_EXIT. So that hypervisor can catch the event and dispatch it.

And the requested irqs are used to invoke the callback functions, i.e. skb_xmit_done() and skb_recv_done() .

2.3 How VirtIO is used in QEMU

QEMU runs on top of KVM and it interacts with KVM via /dev/kvm . KVM has to cooperate with QEMU. Actually, a virtual machine in KVM is merely a process.

VM instructions execute natively on CPU. However, when a VM executes some sensitive instruction, it will be trapped by KVM. Then KVM passes this instruction to QEMU to emulate / handle it.

KVM hands over the instruction to QEMU in kvm_cpu_exec(), which is in kvm-all.c . There is a `switch` on the exit_reason. Exit reasons include (port) IO, interrupt and MMIO, etc.

QEMU has full access of guest’s memory. But it has to grab the virtqueue inside VM first to communicate with vritio_net. When VM calls setup_vq(), it voluntarily writes virtqueue’s address to (vp_dev->ioaddr+VIRTIO_PCI_QUEUE_PFN), which will be trapped by KVM. KVM passes it to QEMU. QEMU then calls virtio_ioport_write() -> virtio_queue_set_addr() to set VRing, which is the control structure used in VirtIO in QEMU. This is how QEMU and VM create their data-sharing channel.

As for notification channel, it seems much easier. As mentioned above, VM writes index to VIRTIO_PCI_QUEUE_NOTIFY. It is trapped by virtio_ioport_write(), then passed to virtio_queue_notify(). In virtio net’s case, this request is finally handled by virtio_net_handle_rx() or virtio_net_handle_tx_{timer,bh}() .

There are many VirtIO-related files reside in the hw/ subdirectory. And also many other files in Linux kernel’s directory.

3 How to port VirtIO to Xen

Now that we’ve got our first impression on VirtIO. It time to discuss how we can port it to Xen.

3.1 PV-on-HVM

It is obvious that in the PV-on-HVM case, things are more or less the same as they are in KVM. Xen traps VM_EXIT and passes exception to QEMU. QEMU emulates. Then Xen sends result, VM resumes running.

3.1.1 How to grab virtqueue address

Xen utilizes QEMU as KVM does. So Linux kernel can stay untouched. Xen captures guest’s write to PCI configuration space, then QEMU will handle changes in VirtIO configuration.

3.1.2 How to deliver event

In VirtIO’s current virtual PCI implementation ($QEMU/hw/virtio-pic.c), it uses KVM’s event notification functions like kvm_set_ioeventfd_pio_word() and kvm_has_many_ioeventfds(). It is necessary to replace them with corresponding implementation in Xen. Anthony’s QEMU-dm ships with xen-all.c , it might give me some hints on implementation.

3.2 Normal PV

When it comes to normal PV case, we will have to do things in different way. I will try my best to detail what should be done and how it is done. However, this is just a rough design, things may change when implementing.

3.2.1 How to grab virtqueue address

In Xen’s covention, it is common for Dom0 / DomU to use Xenstore to expose their public information like netfront/netback, blkfront/blkback, etc. So it is a good idea to use Xenstore to expose VirtIO information. Xenstore’s well-defined API will greatly reduce work needed.

Down to implementation level, it is necessary to replace virtual PCI bus with XenBus. VirtIO utilizes virtual PCI to configure network device. However, in normal PV case, it is not necessary to expose a virtual PCI device to VM. We can follow the pattern how netfront and netback establish their channel.

QEMU-dm from Anthony has functions to manipulate Xenstore, that should help a lot. It also has xen_nic.c, which can greatly inspire how I can implement a VirtIO network for Xen.

3.2.2 How to deliver event

No doubt that event channel is the best choice. Anthony’s QEMU-dm contains a file named xen_backend.c, which is used for event handling. Linux kernel has event channel handling functions, too. (drivers/xen/{events,evtchn}.c)

So, just replace any notification-related function with Xen’s implementation. That’s the plan.

3.3 Other stuff

In PV-on-HVM case, QEMU needs to emulate. In normal PV case, no emulation is needed. Either case, QEMU works as backend dispatcher for VirtIO. Once the channel between two VMs are established, QEMU is supposed to work out of the box. However, I can’t be too optimistic here, it might require some work, such as rewriting and debugging some functions.

3.4 Knowledge required

To be honest, QEMU’s concept (like proxy / virtual device management) is somewhat strange to me. There are some high level design document, but they are just too high-level. A thorough understanding of QEMU’s internal is required.

Knowledge of hareware virtualization is also required. I need to understand how Xen implements HVM interface and choose the right function for certain functionality.

Knowledge of XenBus configuration is a must in normal PV porting. I’ve read about it before, so this is the easier part.

4 Performance tests

Performance tests will be run with industrial standard software like kernbench, ioperf and netperf. Testsuits will be run on several different configurations:

  • Native Linux, CPU, disk and network.
  • Xen with normal PV VirtIO support, CPU, disk and network.
  • Xen with PV-on-HVM VirtIO support, CPU, disk and network.
  • Xen with original PV support, CPU, disk and network.
  • KVM with VirtIO support, CPU, disk and network.

And a short report will be written based on the result, which compares between outcoming data and analyzes advantages / disadvantages between configurations.

5 Porting of Spice

Spice will be ported to Xen’s HVM environment as a real-world testsuit. According to its design, Spice communicates with QEMU via Virtual Device Interface (VDI). Spice client and server run entirely in userland (correct me if I’m wrong, I’m not Spice expert). If we are able to run QEMU with QXL or any other VirtIO devices on Xen, it would not be so hard to get Spice running on Xen.

AFAIK, QXL in QEMU (hw/qxl.c) uses its own paravirtualized ring implementation. It also use qemu_set_irq() to deliver event. So the main idea is to replace this implementation with Xen’s ones, which is already done in porting VirtIO.

The plan is to run Spice with our modified QEMU and eliminate any bugs encountered.

6 Reference

  • Linux kernel 2.6.38.2
  • QEMU-dm, git://xenbits.xen.org/people/aperard/qemu-dm.git
  • Xen-unstable, git://xenbits.xen.org/xen-unstable.git
  • Spice project, spice-space.org

HTML generated by org-mode 6.21b in emacs 23


Written by liuw

April 26th, 2011 at 10:38 pm

Posted in Programming

Tagged with , ,

C的可变参数宏

without comments

GCC预处理器支持variadic macro,即是可变参数宏。定义方法和可变参数函数的定义很相似。

#define SCHED_OP(fn, ...) \
    (( ops.fn != NULL ) ? ops.fn(__VA_ARGS__) \
    : ( typeof(ops.fn(__VA_ARGS__)))0 )

在宏定义的时候用三个点来表示可变参数,后面用__VA_ARGS__来表示可变参数。

虽说两者“相似”,但是实际上可变参数宏和可变参数函数还是有区别的。如:

#define eprintf(format, ...) fprintf (stderr, format, __VA_ARGS__)

这个宏必须保证__VA_ARGS__至少含有一个参数,否则format之后会多一个逗号,语法错误。

不过这个问题还是有解决方法的,像下面的写法,即使__VA_ARGS__为空,预处理后也不会出现问题。预处理器会把前面的逗号一并删除。

#define eprintf(format, ...) fprintf (stderr, format, ##__VA_ARGS__)

参考:http://gcc.gnu.org/onlinedocs/cpp/Variadic-Macros.html

Written by liuw

March 29th, 2011 at 1:27 pm

Posted in Programming

Tagged with , , ,

Meta-Programming小试

with 2 comments

今天下午搜索了一把,了解了一下Perl、Python和Ruby的Meta-Programming能力,发现Ruby的设计层面和语法层面的支持是三者中最好的。Rails中大量用到了Meta-Programming的技巧:比如:has_many和find_by_XXX等等。

Ruby在Meta-Programming方面的优势有:

  • 所有对象都是开放的
  • 所有定义都是“活跃”的,可以直接执行代码
  • 提供很多Meta-Programming可用的hook

我原来对Meta-Programming不甚了解,简单看了一下教程,也可以马上写出可以运行的代码。

class Dog
  def initialize(name)
    @name = name
  end

  def can(*skills)
    skills.each do |skill|
      case skill
      when :dance then
        def self.dance
          @name + " is dancing"
        end
      when :poo then
        def self.poo
          @name + " is a smelly dog"
        end
      when :laugh then
        def self.laugh
          @name + " finds this hilarious!"
        end
      end

    end
  end

  def method_missing(methodname)
    @name + " doesn't understand " + methodname.to_s
  end

end

lassie, fido, stimpy = %w[Lassie Fido Stimpy].collect{|name| Dog.new(name)}

lassie.can :dance, :poo, :laugh
fido.can :poo
stimpy.can :dance

p lassie.dance
p lassie.poo
p lassie.laugh
puts
p fido.dance
p fido.poo
p fido.laugh
puts
p stimpy.dance
p stimpy.poo
p stimpy.laugh

Python版本(不知道是否完全正确):

class Dog:
    def __init__(self, name):
        self.name = name

    def can(self, *skills):
        for skill in skills:
            setattr(self, str(skill), eval("lambda : '"+skill+"'",globals(),locals()))

    def __getattr__(self, methodname):
        return lambda: self.name + " doesn't understand " + methodname

lassie = Dog("Lassie")
lassie.can('dance', 'poo')

stimpy = Dog("Stimpy")
stimpy.can('laugh')

print lassie.dance()
print lassie.poo()

print stimpy.dance()

Written by liuw

March 2nd, 2011 at 3:02 pm

Posted in Programming

Tagged with , , , ,

Continuation Passing Style的map和filter

without comments

近来在重新学习Haskell,自我感觉还是很有收获的,原来对这个语言有很多的误解(以前看得不够仔细,想得不够多)。

放出Continuation Passing Style的map和filter,自己做的Yet Another Haskell Tutorial练习4-12的解答。

cmap' f' z [] = z
cmap' f' z (x:xs) = f' x z (\y -> cmap' f' y xs)

cmap f l = cmap' (\x t g -> (f x):(g t)) [] l

cfilter' f' z [] = z
cfilter' f' z (x:xs) = f' x z (\y -> cfilter' f' y xs)

cfilter f l = cfilter' (\x t g -> (if f x then [x] else []) ++ (g t)) [] l

cmap’和cfilter’形式是上一样的,关键还是传入的f’不同。

CPS的妙处目前还没什么深刻的体会。还要多学习。

Written by liuw

October 20th, 2010 at 10:51 am