(Инструкция по викификации сайта Sources.RU)
Tidy - cleans HTML code.
This method give cleaner code even than Tidy.
Just strip two parts of code:
at beginning until </h3> tag and from </td> to end of HTML:
Bla-bla-bla Bla-bla-bla ... </h3> ----cut here---- clean HTML code ----cut here---- </td> ... Bla-bla-bla Bla-bla-bla
Replace footer with this:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1251"> </head> <body>
Replace end with this:
</body> </html>
So, final HTML will look like this:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1251"> </head> <body> clean HTML code </body> </html>
We need one of two converters. In common, we need use HTML-WikiConverter for converting to Wiki syntax, but this script not always giving us desirable results, so we'll use HTML-Parser for converting to plain text:
Content of wikify.sh shell script, that uses html2wiki Perl script:
#!/bin/sh # Romiras 17/01/2006 # Converter for all html files to DokuWiki syntax # "*.html.new" -> "*.html.new.wiki.txt" # Path to 'html2wiki' script PATH_TO="." # Encoding of input HTML files HTML_ENCODING="Windows-1251" # Log output MYLOG="~/wikify.log" for i in `find ./ -name "*.html.new"` do $PATH_TO/html2wiki --dialect DokuWiki --encoding $HTML_ENCODING <"$i" >./"$i.wiki.txt" | tee $MYLOG # do echo "$i.wiki.txt" | tee $MYLOG >/dev/null done
Content of textify.sh shell script, that uses html2txt.pl script:
#!/bin/sh # Romiras 17/01/2006 # Converter for all html files to plain text: # "*.html.new" -> "*.html.new.plain.txt" for i in `find ./ -name "*.new"` do /home/knoppix/html2txt.pl $i > ./$i.plain.txt done
For example, such as «delphi_xxx.html» to «xxx.html» by regular expression:
«delphi_»
→ «»
Some pages on site are placed chaotically. So, at first, we need to build tree of links. After that, when the tree contain sorted tree, it possible to create same structure by tree of directories with names as links named, with extension 'txt'.