Liuw's Thinkpad

想要赢就先学会输,想要成功就先学会失败

Archive for the ‘kernel’ tag

Oops debug小经验

without comments

常在河边走哪能不湿脚,写程序没有遇到bug那是不可能的。内核菜鸟写代码,出个kernel oops那更是家常便饭。

从我目前遇到的各种oops来看,debug首先要检验能否稳定重现bug。

若不能稳定重现(触发条件不一,Oops信息不一),那么恭喜了,基本是出现竞态了。这种问题可大可小,当然最后的问题肯定是在自己的代码里面的(前提是其他模块都是稳定的)。这个时候当然是得先把code path都走一遍,检查是不是逻辑有问题;之后就是检查各种锁了。

若能稳定重现,那么相对好办点。慢慢用printk定位到出问题的语句就可以了。objdump反汇编一下,配合oops里面的stack trace信息看看大概是哪个语句出问题了,不过个人觉得帮助不大。更加高级点的工具我基本不会用,囧。

最NB的debug工具还是printk,因为它在任何上下文都是健壮的,这是一个多么伟大的特性啊!(囧)

什么?没有stack trace?你没有开内核的debug功能?当我啥都没说吧。

Written by liuw

January 6th, 2012 at 3:37 pm

Posted in Programming

Tagged with , ,

一个简单的Kernel用的内存池

without comments

经过几天的编程,netback现在已经转移到thread-per-vif的模型了,非常有限的测试表示在某些情况下新netback的性能还比旧的好的点(至少没有比原来的差)。

为了限制新模型的netback内存占用情况,需要实现一个统一的内存池。本着能简单就简单的原则,用一个free list来管理可用的entry,实现简单的get one和put one操作。以后可能还会用到,先记下。不说实现得有多好,至少是够用了。

所有的页都是通过alloc_page申请的,所以初始状态下它们的reference count应该是1。

锁的使用尽量少,put的时候没有加锁,因为写操作是原子的,真正的页面回收工作推迟到池中没有空闲页面的时候再进行。但是假如要实现get n和put n的操作,那么put的时候也必须要加锁。

#define PAGE_POOL_ORDER         10
#define PAGE_POOL_SIZE          (1 << PAGE_POOL_ORDER)
#define PAGE_POOL_NR_RESERVED   0

typedef uint32_t idx_t;
#define FREE_LIST_END           0xffffffff
#define INVAL_IDX               FREE_LIST_END
#define RECLAIMABLE             (INVAL_IDX - 1)

struct page_pool {
        struct page      **pool;
        idx_t             *free_list;
        idx_t              free_head;
        int                free_count;
        spinlock_t         lock;
};

static struct page_pool *page_pool;

int get_free_entry(void)
{
        unsigned long flag;
        idx_t head;
        int idx;

        spin_lock_irqsave(&page_pool->lock, flag);

        if (page_pool->free_count == 0) {
                /* XXX liuw: reclaim pages */
                /* free_list[idx] == RECLAIMABLE AND */
                /* page->_count == 1 */
                for (idx = 0; idx < PAGE_POOL_SIZE; idx++)
                        if (page_pool->free_list[idx] == RECLAIMABLE &&
                            page_count(page_pool->pool[idx]) == 1) {
                                /* __reclaim_free_entry(idx); */
                                page_pool->free_list[idx] =
                                        page_pool->free_head;
                                page_pool->free_head = idx;
                                page_pool->free_count++;
                        }
                if (likely(page_pool->free_count > 0))
                        goto found_free_page;
                spin_unlock_irqrestore(&page_pool->lock, flag);
                return -ENOSPC;
        }

found_free_page:
        idx = head = page_pool->free_head;
        page_pool->free_count--;
        page_pool->free_head = page_pool->free_list[head];
        page_pool->free_list[head] = FREE_LIST_END;

        spin_unlock_irqrestore(&page_pool->lock, flag);

        return idx;
}

static inline void put_free_entry(idx_t idx)
{
        page_pool->free_list[idx] = RECLAIMABLE;
}

Written by liuw

December 12th, 2011 at 11:29 pm

Posted in Programming

Tagged with ,

给Linux kernel写patch有多难?

with one comment

Good question!

答案是:一点也不难。随便打开一个文件,然后找出comment里面的拼写及语法错误,再然后按照使用Git发送patch提到的方法发送出去就ok了。一般都可以被ack然后进仓库的。

这是一个很严肃的事情哦,为社区做贡献。门槛也不高,值得一试。:-)

当然,前提是要会用Git以及有一本好词典,呵呵。

Written by liuw

December 8th, 2011 at 6:39 pm

Posted in 戏言

Tagged with , , ,

内核级开发的一些感悟

without comments

出了问题,第一步就应该是检查typo,而不是逻辑和设计。

昨天写的代码dead lock了,一上来就先检查逻辑,这是不对的。后来发现是一个lock写成了unlock。

内核代码的要求是near perfect,要对自己有信心,因为当代码分块足够小时,逻辑上出错误是很难的。

调试很难,printk是最好的朋友。但是归根到底还是要对设计和实现有深刻的理解。个人习惯是在预想的执行路径中放置printk,然后对照实际执行的情况,屡试不爽。

特别注意一些可重入/不可重入的地方,以及一些lock的条件。

Written by liuw

December 15th, 2010 at 11:41 am

Posted in Tech

Tagged with , ,

Linux内核中的通用数据结构

without comments

Linux内核中实现了一些通用的数据结构,目前我所知的有:

  1. 双向链表:include/linux/list.h
  2. 红黑树:include/linux/rbtree.h lib/rbtree.c
  3. 基数树:include/linux/radix-tree.h lib/radix-tree.c
  4. 环形链表:include/linux/circ_buf.h

Linux内核中的通用数据结构,大部分提供的是“关节”连接点(这个是我自造的词)。这样的好处是程序员还是把主要精力放在目标数据结构上,使用通用函数完成基本操作,而不是把目标数据结构嵌入到通用数据结构中。

这两天在改Xen的代码,需要树去存储和查找数据,于是把Linux里面的红黑树移植到了Xen。顺便也把相关的代码写了。内核中的数据结构为了“通用”,通常只提供最小的功能集。链表还好,因为操作比较简单,所以list.h中全部把这些功能全部都实现了。红黑树比较复杂,插入和搜索这些操作要求使用者自己实现。

(原想把Xen改的那部分放出来,但是那样不大合适。这文章一点营养都没有,不用看了。)

Author: Wei LIU
<liuw at liuw dot name>

Date: 2010-12-07 19:30:50 CST

HTML generated by org-mode 6.21b in emacs 23

Written by liuw

December 7th, 2010 at 7:33 pm

Linux内核释放页表的过程

without comments

代码版本2.6.18-xen。

Linux在进程退出的时候,会调用mmput,mmput再调用exit_mmap。

先调用unmap_vmas去回收物理页框。unmap_vmas调用unmap_page_range。

unmap_page_range使用依次释放pud,pmd和pte。

相关函数是zap_pud_range,zap_pmd_range和zap_pte_range。

在zap_pte_range中,使用vm_normal_page把page结构取回,然后使用ptep_get_and_clear_full把PTE清0。最后把page的引用计数减1。

在unmap_vmas完成之后,再调用free_pgtables。它分别free_pgd_range,free_pud_range,free_pmd_range和free_pte_range。但是free_pte_range并不实际释放物理页,物理页在前面已经释放了。

Xen在zap_pte_range里面调用xen_l1_entry_update让Xen更新实际的PTE。非XenoLinux应该是直接更新PTE。

Written by liuw

December 2nd, 2010 at 4:33 pm

Posted in UNIX-like

Tagged with , , ,

Symbol Type Notation in System.map

without comments

System.map is created by `nm(1)’ with the following command:

$ nm /boot/vmlinux-xxxx > System.map-xxxx

Note that ‘vmlinux’ is not ‘vmlinuz’. We use vmlinuz to boot the system, while vmlinux is only an intermediate object when producing vmlinuz.

There are several types used in a kernel: AbBdDrRtTW .

So the symbol type notations are (taken from `man nm’) :

If lowercase, the symbol is local; if uppercase, the symbol is global (external).

  • “A” The symbol’s value is absolute, and will not be changed by further linking.
  • “B” “b” The symbol is in the uninitialized data section (known as BSS).
  • “D” “d” The symbol is in the initialized data section.
  • “R” “r” The symbol is in a read only data section.
  • “T” “t” The symbol is in a text (code) section.
  • “W” “w” The symbol is a weak symbol that has not been specifically tagged as a weak object symbol. When a weak defined symbol is linked with a normal defined symbol, the normal defined symbol is used with no error. When a weak undefined symbol is linked and the symbol is not defined, the value of the symbol is determined in a system-specific manner without error. On some systems, uppercase indicates that a default value has been specified.

There are other types of symbols, see manpage for details.

Written by liuw

November 11th, 2010 at 4:53 pm

Posted in UNIX-like

Tagged with , , , , ,

Building modules against installed kernel

without comments

I’ve been reading Linux Device Drivers 3rd for quite a long time, and once built a sacrifice system in VMWare. After I upgraded VMWare, the sacrifice kernel hangs, unable to discover root filesystem. It seems to be a driver issue. No matter how I compile my kernel, it hangs at the same place.

I finally give up, I don’t want to waste my time any more. Maybe I should just dump that 2.6.10 and try new ones. No need to

I have a Debian system working as my development system. I just need to install kernel header to get a module building environment.

# apt-get install linux-headers-`uname -r`

When it’s done, build directory should be found in

# ls -d /lib/modules/`uname -r`/build

Here is a handy Makefile for modules.

obj-m += hello.o

all:
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

Written by liuw

July 16th, 2010 at 11:49 pm

Posted in UNIX-like

Tagged with , , , ,

Cheet sheet for locking

without comments

Pete Zaitcev gives the following summary:

  • If you are in a process context (any syscall) and want to lock other process out, use a semaphore. You can take a semaphore and sleep ( copy_from_user* or kmalloc(x,GFP_KERNEL) ).
  • Otherwise (== data can be touched in an interrupt), use spin_lock_irqsave() and spin_unlock_irqrestore().
  • Avoid holding spinlock for more than 5 lines of code and across any function call (except accessors like readb).

Read the rest of this entry »

Written by liuw

July 14th, 2010 at 2:04 pm

RCU revisited

without comments

I used to find some papers on Read-Copy Update mechanism, but I didn’t quite get the point, that the Updater is sure that no reader will hold a reference to old data when all CPU has been scheduled at least once so that the Updater can safely reclaim the old data.

I found the answer in Linux Device Drivers 3rd edition, as quoted.

On the read side, code using an RCU-protected data structure should bracket its references with calls to rcu_read_lock and rcu_read_unlock. as a result, RCU code tends to look like:

struct my_stuff *stuff;

rcu_read_lock();
stuff = find_the_stuff(args…);
do_something_with(stuff);
rcu_read_unlock();

The rcu_read_lock call is fast; it disables kernel preemption but does not wait for anything. The code that executes while the read “lock” is held must be atomic. No reference to the protected resource may be used after the call to rcu_read_unlock.

[...snip...]

All that remains is to free the old version. The problem, of course, is that code running on other processors may still have a reference to the older data, so it cannot be freed immediately. Instead, the write code must wait until it knows that no such reference can exist. Since all code holding references to this data stucture must (by the rules) be atomic, we know that once every processor on the system has been scheduled at least once, all references must be gone. So that is what RCU does; it sets aside a callback that waits until all processors have scheduled; that callback is then run to perform the cleanup work.

As the bold text mentioned above, RCU is working mostly by the rules.

Written by liuw

July 12th, 2010 at 8:56 pm