HTML to text; text to HTML

HTML Conversions

Converting a html file to plain text

The following shell script will convert any html file to plain text and maintain the links as a list at the end of the text file. It assumes that the *.htm or *html files to be converted are in the directory where the script (h2t) is executed.

#!/bin/sh # h2t, convert all htm and html files of a directory to text for file in `ls *.htm` do new=`basename $file htm` lynx -dump $file > ${new}txt done ##### for file in `ls *.html` do new=`basename $file html` lynx -dump $file > ${new}txt done

I did not send the error messages from ls to /dev/null, so if a file is not found, you will get a screen message "file not found". To have all the internal links referenced by a list at the end of the text file, you will need to set Lynx up correctly.
Converting a plain text file to html

This is a sed script. If the text is reasonably formatted, a fully useable html file will result. As set up below the script (t2h) will do the following..

substitute the word " at " for any " at " signs
remove any 8-bit characters
reduce lines with tabs and blanks to no space at all
remove duplicate blank lines (leaving one between parapgraphs)
place /UL P on remaining blank lines (paragraphing)
remove the line breaks after the /UL P (ends up on next line)
indent any Paragraphs (not lines) starting with a quote mark (and removes leading tabs and spaces)
introduce a BR tag on any line starting with a hyphen (and removes leading tabs and spaces)
convert http://URL to a link

#!/bin/sh # t2h {$1} html-ize a text file and save as foo.htm NL=" " cat $1 \ | sed -e 's/ at / at /g' \ | sed -e 's/[[:cntrl:]]/ /g'\ | sed -e 's/^[[:space:]]*$//g' \ | sed -e '/^$/{'"$NL"'N'"$NL"'/^\n$/D'"$NL"'}' \ | sed -e 's/^$/<\/UL>/g' \ | sed -e '/$/{'"$NL"'N'"$NL"'s/\n//'"$NL"'}'\ | sed -e 's/[[:space:]]*"/<UL>"/' \ | sed -e 's/^[[:space:]]*-/ -/g' \ | sed -e 's/http:\/\/[[:graph:]\.\/]*/<A HREF="&">[&]<\/A> /g'\ > foo.htm

Obviously the HEAD section of the html file, or the enclosing BODY and HTML tags are not written (they are also not required under HTML-4). These could be added with an additional line to the script, like..
cat header foo.htm tail > bar.htm

Additionally, you could include other HTML tags within the original text - as long as you do not use something which the sed script would alter.

Website Provider: Outflux.net, www.Outflux.net
URL:http://jnocook.net/geek/htm.htm