iOS 9 HTMLParser中文乱码问题

xCode 升级至7.2并使用 iOS9 SDK 之后,测试之前完成的代码 Cocode,发现使用 HTMLParser 解析的网页中文内容显示乱码,首先想到的解决办法是改变文本编码,如 UTF-8,无果。搜索一番后,找到了原因。

Tagged Pointer

iOS9或 OSX10.10之后,才用了名为Tagged Pointer的技术,该技术能够提升性能、节省内存,本文相关问题则是由于 NSString 采用了该技术所致。关于Tagged Pointer技术请查看Mike Ash的文章译文【译】采用Tagged Pointer的字符串。简单的说就是使用指针直接存储具体内容,省去通过指针链接真正内存地址的访问环节,该技术在 NSString 上的实现比较有难度。

Tagged Pointer NSString

继续回到正题,由于我的代码中使用了 HTMLParser 外部工具库,该库的

-(id)initWithString:(NSString*)string error:(NSError**)error

方法中使用了CFStringGetCStringPtr等受到Tagged Pointer NSString影响的函数。

请看下面的说明:

Starting with iOS 9 certain strings (ones with a suitable length and encoding) on 64 bit architectures will now use a “tagged pointer” format where the string contents are stored directly in the pointer. This matches behavior introduced on OS X in 10.10 Yosemite.

Similarly to the OS X targets, passing a tagged NSString to functions such as CFStringGetCStringPtr or CFStringGetCharactersPtr will return NULL in cases where it may not have before. As before it is important to check for the NULL return value and use the corresponding buffer fetching function:

char buffer[BUFSIZE];  
const char *ptr = CFStringGetCStringPtr(str, encoding);  
if (ptr == NULL) {  
    if (CFStringGetCString(str, buffer, BUFSIZE, encoding)) ptr = buffer;
}

In addition, this change enables index and range checking to be performed more often when working with strings. So you may see runtime exceptions due to this change. In almost all cases these exceptions point to bugs in code, so please take them seriously.

This change will also break any code which treats NS/CFStrings objects as pointers and attempts to dereference them.  

解决方案

修改 HTMLParser.m

-(id)initWithString:(NSString*)string error:(NSError**)error

方法的实现方法为:

-(id)initWithString:(NSString*)string error:(NSError**)error
{ 
    if (self = [super init])
    {
        _doc = NULL;

        if ([string length] > 0)
        {
            CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(NSUTF8StringEncoding);
            CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc);
            const char *enc = CFStringGetCStringPtr(cfencstr, 0);
            //Fix iOS9 Chinese wrong characters - begin
            char buffer[255];
            if (enc == NULL) {
                if (CFStringGetCString(cfencstr, buffer, 255, kCFStringEncodingUTF8)) enc = buffer;
            }
            //Fix iOS9 Chinese wrong characters - end
            // _doc = htmlParseDoc((xmlChar*)[string UTF8String], enc);
            int optionsHtml = HTML_PARSE_RECOVER;
            optionsHtml = optionsHtml | HTML_PARSE_NOERROR; //Uncomment this to see HTML errors
            optionsHtml = optionsHtml | HTML_PARSE_NOWARNING;
            _doc = htmlReadDoc ((xmlChar*)[string UTF8String], NULL, enc, optionsHtml);
        }
        else 
        {
            if (error) {
                *error = [NSError errorWithDomain:@"HTMLParserdomain" code:1 userInfo:nil];
            }
        }
    }

    return self;
}

再次测试 CoCode 代码乱码问题已经解决。