HTML [] Conversions

Converting a html file to plain text

The following shell script will convert any html file to plain text and maintain the links as a list at the end of the text file. It assumes that the *.htm or *html files to be converted are in the directory where the script (h2t) is executed.

#!/bin/sh
# h2t, convert all htm and html files of a directory to text 
for file in `ls *.htm`
do
new=`basename $file htm`
lynx -dump $file > ${new}txt 
done
#####
for file in `ls *.html`
do
new=`basename $file html`
lynx -dump $file > ${new}txt 
done

I did not send the error messages from ls to /dev/null, so if a file is not found, you will get a screen message "file not found". To have all the internal links referenced by a list at the end of the text file, you will need to set Lynx up correctly.

Converting a plain text file to html

This is a sed script. If the text is reasonably formatted, a fully useable html file will result. As set up below the script (t2h) will do the following..

#!/bin/sh
# t2h {$1} html-ize a text file and save as foo.htm
NL="
"
cat $1 \
| sed -e 's/ at / at /g' \
| sed -e 's/[[:cntrl:]]/ /g'\
| sed -e 's/^[[:space:]]*$//g' \
| sed -e '/^$/{'"$NL"'N'"$NL"'/^\n$/D'"$NL"'}' \
| sed -e 's/^$/<\/UL><P>/g' \
| sed -e '/<P>$/{'"$NL"'N'"$NL"'s/\n//'"$NL"'}'\
| sed -e 's/<P>[[:space:]]*"/<P><UL>"/' \
| sed -e 's/^[[:space:]]*-/<BR> -/g' \
| sed -e 's/http:\/\/[[:graph:]\.\/]*/<A HREF="&">[&]<\/A> /g'\
                                > foo.htm

Obviously the HEAD section of the html file, or the enclosing BODY and HTML tags are not written (they are also not required under HTML-4). These could be added with an additional line to the script, like..

cat header foo.htm tail > bar.htm

Additionally, you could include other HTML tags within the original text - as long as you do not use something which the sed script would alter.


[logo]


Website Provider: Outflux.net, www.Outflux.net
URL:http://jnocook.net/geek/htm.htm