String#unpack对应的UTF-8是怎么回事？

RednaxelaFX

浏览: 3019438 次
性别:
来自: 海外

最近访客更多访客>>

suen

xckouy

limn_xmj

lingxiajiudu

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Ruby

Ruby C C++C#

Ruby每周一测 - 中英文混合字符串截取

Quake Wang发的这个测试相当有趣，值得一看。我也算是被Ruby的字符编码问题困扰了好段时间了，这次果然又中招了。
老庄的解法：

庄表伟写道

def truncate_u(text, length = 30, truncate_string = "...")
  l=0
  char_array=text.unpack("U*")
  char_array.each_with_index do |c,i|
    l = l+ (c<127 ? 0.5 : 1)
    if l>=length
      return char_array[0..i].pack("U*")+(i<char_array.length-1 ? truncate_string : "")
    end
  end
  return text
end

看到老庄的解法，我的第一直觉是：UTF-8中CJK应该是三字节的啊，这样unpack之后算出来的值不是不对了么？然后看看RDoc怎么说的：

-------+---------+-----------------------------------------
  U    | Integer | UTF-8 characters as unsigned integers
-------+---------+-----------------------------------------

看到这个文档我还以为是把UTF-8的字符串拆成字节，结果原来是每个字符对应一个整型数字。

可是这个对应关系到底是怎样的……String#unpack的实现在pack.c里：

case 'U':
  if (len > send - s) len = send - s;
  while (len > 0 && s < send) {
    long alen = send - s;
    unsigned long l;
    
    l = utf8_to_uv(s, &alen);
    s += alen; len--;
    rb_ary_push(ary, ULONG2NUM(l));
  }
  break;

然后单个字符的转换函数是：

static unsigned long
utf8_to_uv(p, lenp)
    char *p;
    long *lenp;
{
    int c = *p++ & 0xff;
    unsigned long uv = c;
    long n;

    if (!(uv & 0x80)) {
        *lenp = 1;
            return uv;
    }
    if (!(uv & 0x40)) {
        *lenp = 1;
        rb_raise(rb_eArgError, "malformed UTF-8 character");
    }

    if      (!(uv & 0x20)) { n = 2; uv &= 0x1f; }
    else if (!(uv & 0x10)) { n = 3; uv &= 0x0f; }
    else if (!(uv & 0x08)) { n = 4; uv &= 0x07; }
    else if (!(uv & 0x04)) { n = 5; uv &= 0x03; }
    else if (!(uv & 0x02)) { n = 6; uv &= 0x01; }
    else {
        *lenp = 1;
        rb_raise(rb_eArgError, "malformed UTF-8 character");
    }
    if (n > *lenp) {
        rb_raise(rb_eArgError, "malformed UTF-8 character (expected %d bytes, given %d bytes)",
             n, *lenp);
    }
    *lenp = n--;
    if (n != 0) {
    while (n--) {
        c = *p++ & 0xff;
        if ((c & 0xc0) != 0x80) {
            *lenp -= n + 1;
            rb_raise(rb_eArgError, "malformed UTF-8 character");
        }
        else {
            c &= 0x3f;
            uv = uv << 6 | c;
        }
    }
    }
    n = *lenp - 1;
    if (uv < utf8_limits[n]) {
        rb_raiserb_eArgError, "redundant UTF-8 sequence");
    }
    return uv;
}

先确定UTF-8字符的长度（字节数），然后在while循环里编码……但是那几个magic number到底是什么意思我还是没弄明白，主要是那个0x3f和6。回去翻翻UTF-8的说明再看看……

分享到：

JRuby的类重定义……为什么不行？ | Ruby 1.8.x中复合赋值运算符的实现

2008-06-12 17:42
浏览 2579
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

String#unpack对应的UTF-8是怎么回事？

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

String#unpack对应的UTF-8是怎么回事？

评论

发表评论

相关推荐

字符串的一般封装方式的内存布局 (0): 拿在手上的是什么

字符串的一般封装方式的内存布局

RubyConf notes

ShanghaiOnRails第八次线下活动——你不需要知道的Ruby实现

JRuby的运行模式

你不需要知道的Ruby草稿

Ruby里的fiber/coroutine例子

JRuby使用技巧收集

特殊类型的eigenclass

奇怪的参数

MacRuby的执行模式

Rubinius的执行模型

Ruby 1.8和1.9中String#hash的实现

To囧：拿你来测测Watir...

[标题党] MagLev中GC类的真相……

小试rubyzip的一个脚本

爬一下Google和百度看口碑对它们做的SEO效果如何

把Mechanize的html_parser改回到Hpricot

使用新的RubyInstaller

Ruby metaprogramming tech notes

最近访客更多访客>>