Microsoft Word 2007 produces relatively clean HTML when you use the Publish feature to create a blog post. Although the XHTML purist will still be unhappy with anything they don't write themselves, the HTML far surpasses anything we've seen from Word in previous versions. Unfortunately, this feature is only available for blog posting, as far as I can tell. The "Web Page" and "Web Page, Filtered" options in the "Save As" menu still produce the same .mso littered HTML that makes Word impossible to use as a serious HTML editor. I'd like to use the HTML output by the Publish feature for purposes other than blogging.
I wasn't sure how to get to this Publish feature, but after looking at the MetaBlog API that Word can consume, I decided it wouldn't be too hard to write something in ASP.NET that would run on localhost and give me exactly what I wanted from Word. Specifically – convert a document into clean HTML and PNG graphics and drop the files into a local directory. I decided this job was even easier when I snooped around the SubText subversion repository and discovered that Cook Computing's XML-RPC library does all the heavy lifting and XML parsing.
It took a bit of debugging, but the IMetaWeblog interface defined in the XML-RPC library (interfaces\MetaWeblogAPI.cs) needs a few tweaks to work with Word. First, Word invokes a blogger.getUserBlogs method that isn't defined in the interface, but is easy to add:
[XmlRpcMethod("blogger.getUsersBlogs", Description = "...")]
BlogInfo[] getUsersBlogs(string blogid, string username, string password);
Secondly, Word appear to pass an integer for the blogid parameter of the newMediaObject method. The service expects a string. I don't know enough about the history of the MetaBlog API to know who is wrong in this scenario, but it's easy to fix the method definition in the interface.
[XmlRpcMethod("metaWeblog.newMediaObject",
Description = "Makes a new file to a designated blog using the "
+ "metaWeblog API. Returns url as a string of a struct.")]
MediaObjectInfo newMediaObject(
int blogid, // this was a string, but that doesn't work
with Word...
string username,
string password, FileData file);
One last change is to Refactor -> Rename the UrlInfo struct in MetaWeblogAPI.cs to MediaObjectInfo. The rename allows Word and the MetaBlog service to agree on the name of the struct.
public
struct
MediaObjectInfo
// this used to be called UrlInfo
{
public
string url;
}
Once all this is done it's a simple matter to implement that interface in an HttpHandler (ashx file).
public
class
MetaWebLogging : XmlRpcService, IMetaWeblog
{
// ...
}
Each method needs an implementation. For my workflow, I'm moving files around on the hard drive, but here is a sample implementation for the newPost method that will dump the incoming HTML into a file in the root directory of the application.
public
string newPost(string blogid, string
username, string password,
Post post, bool
publish)
{
string fileName = Path.Combine(
HttpContext.Current.Server.MapPath("~"),
post.title + ".htm");
using (FileStream fs = File.OpenWrite(fileName))
using (StreamWriter writer = new
StreamWriter(fs))
{
writer.Write(post.description);
}
return
Path.GetFileName(fileName);
}
Now I just point Word 2007's Publish feature to my local metablog.ashx file and export documents as HTML. For what I needed to do, this little hack was a huge time saver. Hopefully, future versions of Word will make this even easier.
Comments
This makes sense because some blogs might use a GUID or some other value other than an int.
SharePoint 2007 has a document conversions feature where you upload a word doc, and it fires up a workflow and converts & publishes an HTML version of the word doc. That way, you don't have to write ANY code :).
That's 1 step lazier than even this. LOL
SM
In fact I was just yesterday looking at MetaWebLogApi myself to move into my blog hadn't gotten around to it and like my first stop was SubText <s>. Still haven't moved forward with this so any 'standalone' application would help...
Oddly for getCategories (which also receives the blogId parm) the value is passed as a string. Somebody was sleeping on the job...
For newPost another solution is to cast the blogId to an object value in the Interface and method. This lets the same interface work with both string and int values - in both cases it ends up as a string in the code. Actually any of the methods that pass a BlogId would have to be replaced with object (mediaObject in particular).