TIDY - 修正错误的 HTML, XHTML, XML
时间:2014-11-29 22:25 来源: 我爱IT技术网 作者:山风
最近发现有时候抓回来的RSS,会有格式错误的问题,
问了我们这的大师 部落小波 后,他指引了我一条路 【HTML Tidy Library Project】
( 真的觉的 部落小波 懂的还真多耶~~ )
Tidy 这东西可以把一些缺少的或多余的Tag修正,
因为现在很多发表文章的编辑器,都可以自己修改程式码了~
但有时都会把程式码改的乱七八糟~这时候这个东西就挺有用的,
可以帮你修正这些错误~~
很幸运的,PHP也有支援tidy了~
安装方式如下
(1) tidy 安装
我是使用 SuSE, 所以就去找一找 libtidy, libtidy_devel 这两个 rpm后, 给他装上去
(2) 安装 tidy extension in php
./configure --with-tidy=/path/to/libtidy
在您原本的 build options 里面, 加上 --with-tidy
(3) build 完以后, 就可以开心的用了
下来贴上一些范例
如果是修正 XML 的话~
将在【.....丑丑风的老头~~】后面的Tag都删掉
<?phpheader("Content-Type:text/xml");ob_start();?><rss version="2.0"><channel><title>My Lief</title><link>http://blog.xuite.net/chingwei/blog%26lt%3B/link><description>爽快过生活!!</description><item><title>【图】画了个丑丑风的老头</title><link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link><description>丑丑风的老头~~<?php$buffer = ob_get_clean();$tidy_options = array('input-xml' => true,'output-xml' => true,'indent' => true,'wrap' => false,);$tidy = new tidy();$tidy->parseString($buffer, $tidy_options,'utf8');$tidy->cleanRepair();echo $tidy;?>
输出结果,他帮我把少掉的Tag都补上了,真棒。
<rss version="2.0"><channel><title>My Lief</title><link>http://blog.xuite.net/chingwei/blog%26lt%3B/link><description>爽快过生活!!</description><item><title>【图】画了个丑丑风的老头</title><link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link><description>丑丑风的老头~~</description></item></channel></rss>
接着我们将在【.....丑丑风的老头~~】后面再加上<Error 的Tag
<?phpheader("Content-Type:text/xml");ob_start();?><rss version="2.0"><channel><title>My Lief</title><link>http://blog.xuite.net/chingwei/blog%26lt%3B/link><description>爽快过生活!!</description><item><title>【图】画了个丑丑风的老头</title><link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link><description>丑丑风的老头~~<Error<?php$buffer = ob_get_clean();$tidy_options = array('input-xml' => true,'output-xml' => true,'indent' => true,'wrap' => false,);$tidy = new tidy();$tidy->parseString($buffer, $tidy_options,'utf8');$tidy->cleanRepair();echo $tidy;?>
最后的结果就多出了个Error的Tag,
这就不是我想要的结果了,但他的做法应该也没错。
不能太强求~他已经很强了~~
<rss version="2.0"><channel><title>My Lief</title><link>http://blog.xuite.net/chingwei/blog%26lt%3B/link><description>爽快过生活!!</description><item><title>【图】画了个丑丑风的老头</title><link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link><description>丑丑风的老头~~<Error></Error></description></item></channel></rss>
下面贴上php官方网站的Sample,是修正HTML (tidy_repair_string)
<?phpob_start();?><html><head><title>test</title></head><body><p>error</i></body></html><?php$buffer = ob_get_clean();$tidy = tidy_repair_string($buffer);echo $tidy;?>
结果它把少的Tag跟错掉的,都修正好了
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"><html><head><title>test</title></head><body><p>error</p></body></html>
参考网站:
HTML Tidy Library Project
PHP: Tidy
不太会写程式
PHP+Tidy-完美的XHTML纠错+过滤
PS.
这篇文章我重写了三次,我快被这编辑器搞疯了~我只要按Ctrl+V,他就会将我textarea里的 < 给取代掉,天呀~~~~所以我不会再更新这个文章了~~~~
附上 HTML Tidy Configuration Options
| HTML, XHTML, XML Options | Top | |
| Option | Type | Default |
| add-xml-decl | Boolean | no |
| add-xml-space | Boolean | no |
| alt-text | String | - |
| anchor-as-name | Boolean | yes |
| assume-xml-procins | Boolean | no |
| bare | Boolean | no |
| clean | Boolean | no |
| css-prefix | String | - |
| decorate-inferred-ul | Boolean | no |
| doctype | DocType | auto |
| drop-empty-paras | Boolean | yes |
| drop-font-tags | Boolean | no |
| drop-proprietary-attributes | Boolean | no |
| enclose-block-text | Boolean | no |
| enclose-text | Boolean | no |
| escape-cdata | Boolean | no |
| fix-backslash | Boolean | yes |
| fix-bad-comments | Boolean | yes |
| fix-uri | Boolean | yes |
| hide-comments | Boolean | no |
| hide-endtags | Boolean | no |
| indent-cdata | Boolean | no |
| input-xml | Boolean | no |
| join-classes | Boolean | no |
| join-styles | Boolean | yes |
| literal-attributes | Boolean | no |
| logical-emphasis | Boolean | no |
| lower-literals | Boolean | yes |
| merge-divs | AutoBool | auto |
| merge-spans | AutoBool | auto |
| ncr | Boolean | yes |
| new-blocklevel-tags | Tag names | - |
| new-empty-tags | Tag names | - |
| new-inline-tags | Tag names | - |
| new-pre-tags | Tag names | - |
| numeric-entities | Boolean | no |
| output-html | Boolean | no |
| output-xhtml | Boolean | no |
| output-xml | Boolean | no |
| preserve-entities | Boolean | no |
| quote-ampersand | Boolean | yes |
| quote-marks | Boolean | no |
| quote-nbsp | Boolean | yes |
| repeated-attributes | enum | keep-last |
| replace-color | Boolean | no |
| show-body-only | AutoBool | no |
| uppercase-attributes | Boolean | no |
| uppercase-tags | Boolean | no |
| word-2000 | Boolean | no |
| Diagnostics Options | Top | |
| Option | Type | Default |
| accessibility-check | enum | 0 (Tidy Classic) |
| show-errors | Integer | 6 |
| show-warnings | Boolean | yes |
| Pretty Print Options | Top | |
| Option | Type | Default |
| break-before-br | Boolean | no |
| indent | AutoBool | no |
| indent-attributes | Boolean | no |
| indent-spaces | Integer | 2 |
| markup | Boolean | yes |
| punctuation-wrap | Boolean | no |
| sort-attributes | enum | none |
| split | Boolean | no |
| tab-size | Integer | 8 |
| vertical-space | Boolean | no |
| wrap | Integer | 68 |
| wrap-asp | Boolean | yes |
| wrap-attributes | Boolean | no |
| wrap-jste | Boolean | yes |
| wrap-php | Boolean | yes |
| wrap-script-literals | Boolean | no |
| wrap-sections | Boolean | yes |
| Character Encoding Options | Top | |
| Option | Type | Default |
| ascii-chars | Boolean | no |
| char-encoding | Encoding | ascii |
| input-encoding | Encoding | latin1 |
| language | String | - |
| newline | enum | Platform dependent |
| output-bom | AutoBool | auto |
| output-encoding | Encoding | ascii |
| Miscellaneous Options | Top | |
| Option | Type | Default |
| error-file | String | - |
| force-output | Boolean | no |
| gnu-emacs | Boolean | no |
| gnu-emacs-file | String | - |
| keep-time | Boolean | no |
| output-file | String | - |
| quiet | Boolean | no |
| slide-style | String | - |
| tidy-mark | Boolean | yes |
| write-back | Boolean | no |
- 评论列表(网友评论仅供网友表达个人看法,并不表明本站同意其观点或证实其描述)
-
