HTML to Wiki, пошагово

(Инструкция по викификации сайта Sources.RU)

"Tidify" HTML files

Tidy - cleans HTML code.

Another, "tricky" way to clean code

This method give cleaner code even than Tidy.

Just strip two parts of code:
at beginning until </h3> tag and from </td> to end of HTML:

Bla-bla-bla
Bla-bla-bla
...
</h3>
----cut here----
clean HTML code
----cut here----
</td>
...
Bla-bla-bla
Bla-bla-bla

Replace footer with this:

<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
</head>
 
<body>

Replace end with this:

</body>
</html>

So, final HTML will look like this:

<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
</head>
 
<body>
clean HTML code
</body>
</html>

"Textify"/"Wikify" HTML files

We need one of two converters. In common, we need use HTML-WikiConverter for converting to Wiki syntax, but this script not always giving us desirable results, so we'll use HTML-Parser for converting to plain text:

HTML-WikiConverter

HTML-Parser

Content of wikify.sh shell script, that uses html2wiki Perl script:

#!/bin/sh
# Romiras 17/01/2006 
# Converter for all html files to DokuWiki syntax
# "*.html.new" -> "*.html.new.wiki.txt"

# Path to 'html2wiki' script
PATH_TO="."

# Encoding of input HTML files
HTML_ENCODING="Windows-1251"

# Log output
MYLOG="~/wikify.log"

for i in `find ./ -name "*.html.new"`
 do $PATH_TO/html2wiki --dialect DokuWiki --encoding $HTML_ENCODING <"$i" >./"$i.wiki.txt" | tee $MYLOG
# do echo "$i.wiki.txt" | tee $MYLOG >/dev/null
done

Content of textify.sh shell script, that uses html2txt.pl script:

#!/bin/sh
# Romiras 17/01/2006 
# Converter for all html files to plain text:
# "*.html.new" -> "*.html.new.plain.txt"

for i in `find ./ -name "*.new"`
 do /home/knoppix/html2txt.pl $i > ./$i.plain.txt
done

Rename (remove prefixes from) files

For example, such as «delphi_xxx.html» to «xxx.html» by regular expression:

«delphi_» → «»

Build full hierarchy of links and it's directory analogues

Some pages on site are placed chaotically. So, at first, we need to build tree of links. After that, when the tree contain sorted tree, it possible to create same structure by tree of directories with names as links named, with extension 'txt'.

Содержание

HTML to Wiki, пошагово

"Tidify" HTML files

Another, "tricky" way to clean code

"Textify"/"Wikify" HTML files

Rename (remove prefixes from) files

Build full hierarchy of links and it's directory analogues

Содержание

Содержание

Программирование

Системы и технологии