====== HTML to Wiki, пошагово ====== **(Инструкция по викификации сайта Sources.RU)** ===== "Tidify" HTML files ===== [[http://tidy.sourceforge.net/|Tidy]] - cleans HTML code. ==== Another, "tricky" way to clean code==== This method give cleaner code even than Tidy. Just strip two parts of code:\\ at beginning until tag and from to end of HTML: Bla-bla-bla Bla-bla-bla ... ----cut here---- clean HTML code ----cut here---- ... Bla-bla-bla Bla-bla-bla Replace footer with this: Replace end with this: So, final HTML will look like this: clean HTML code ===== "Textify"/"Wikify" HTML files ===== We need one of two converters. In common, we need use HTML-WikiConverter for converting to Wiki syntax, but this script not always giving us desirable results, so we'll use HTML-Parser for converting to plain text: [[http://search.cpan.org/~diberri/HTML-WikiConverter/|HTML-WikiConverter]] [[http://search.cpan.org/~gaas/HTML-Parser-3.48/lib/HTML/TokeParser.pm|HTML-Parser]] Content of **wikify.sh** shell script, that uses //html2wiki// Perl script: #!/bin/sh # Romiras 17/01/2006 # Converter for all html files to DokuWiki syntax # "*.html.new" -> "*.html.new.wiki.txt" # Path to 'html2wiki' script PATH_TO="." # Encoding of input HTML files HTML_ENCODING="Windows-1251" # Log output MYLOG="~/wikify.log" for i in `find ./ -name "*.html.new"` do $PATH_TO/html2wiki --dialect DokuWiki --encoding $HTML_ENCODING <"$i" >./"$i.wiki.txt" | tee $MYLOG # do echo "$i.wiki.txt" | tee $MYLOG >/dev/null done Content of **textify.sh** shell script, that uses //html2txt.pl// script: #!/bin/sh # Romiras 17/01/2006 # Converter for all html files to plain text: # "*.html.new" -> "*.html.new.plain.txt" for i in `find ./ -name "*.new"` do /home/knoppix/html2txt.pl $i > ./$i.plain.txt done ===== Rename (remove prefixes from) files ===== For example, such as "delphi_xxx.html" to "xxx.html" by regular expression: ''"delphi_"'' -> ''""'' ===== Build full hierarchy of links and it's directory analogues ===== Some pages on site are placed chaotically. So, at first, we need to build tree of links. After that, when the tree contain sorted tree, it possible to create same structure by tree of directories with names as links named, with extension 'txt'.