Microsoft Word 2007 produces relatively clean HTML when you use the Publish feature to create a blog post. Although the XHTML purist will still be unhappy with anything they don't write themselves, the HTML far surpasses anything we've seen from Word in previous versions. Unfortunately, this feature is only available for blog posting, as far as I can tell. The "Web Page" and "Web Page, Filtered" options in the "Save As" menu still produce the same .mso littered HTML that makes Word impossible to use as a serious HTML editor. I'd like to use the HTML output by the Publish feature for purposes other than blogging.
I wasn't sure how to get to this Publish feature, but after looking at the MetaBlog API that Word can consume, I decided it wouldn't be too hard to write something in ASP.NET that would run on localhost and give me exactly what I wanted from Word. Specifically – convert a document into clean HTML and PNG graphics and drop the files into a local directory. I decided this job was even easier when I snooped around the SubText subversion repository and discovered that Cook Computing's XML-RPC library does all the heavy lifting and XML parsing.
It took a bit of debugging, but the IMetaWeblog interface defined in the XML-RPC library (interfaces\MetaWeblogAPI.cs) needs a few tweaks to work with Word. First, Word invokes a blogger.getUserBlogs method that isn't defined in the interface, but is easy to add:
[XmlRpcMethod("blogger.getUsersBlogs", Description = "...")]
BlogInfo[] getUsersBlogs(string blogid, string username, string password);
Secondly, Word appear to pass an integer for the blogid parameter of the newMediaObject method. The service expects a string. I don't know enough about the history of the MetaBlog API to know who is wrong in this scenario, but it's easy to fix the method definition in the interface.
[XmlRpcMethod("metaWeblog.newMediaObject",
Description = "Makes a new file to a designated blog using the "
+ "metaWeblog API. Returns url as a string of a struct.")]
MediaObjectInfo newMediaObject(
int blogid, // this was a string, but that doesn't work
with Word...
string username,
string password, FileData file);
One last change is to Refactor -> Rename the UrlInfo struct in MetaWeblogAPI.cs to MediaObjectInfo. The rename allows Word and the MetaBlog service to agree on the name of the struct.
public
struct
MediaObjectInfo
// this used to be called UrlInfo
{
public
string url;
}
Once all this is done it's a simple matter to implement that interface in an HttpHandler (ashx file).
public
class
MetaWebLogging : XmlRpcService, IMetaWeblog
{
// ...
}
Each method needs an implementation. For my workflow, I'm moving files around on the hard drive, but here is a sample implementation for the newPost method that will dump the incoming HTML into a file in the root directory of the application.
public
string newPost(string blogid, string
username, string password,
Post post, bool
publish)
{
string fileName = Path.Combine(
HttpContext.Current.Server.MapPath("~"),
post.title + ".htm");
using (FileStream fs = File.OpenWrite(fileName))
using (StreamWriter writer = new
StreamWriter(fs))
{
writer.Write(post.description);
}
return
Path.GetFileName(fileName);
}
Now I just point Word 2007's Publish feature to my local metablog.ashx file and export documents as HTML. For what I needed to do, this little hack was a huge time saver. Hopefully, future versions of Word will make this even easier.