Untitled Document

This is part of the continuing ebizQ Open Source Software (OSS) “Applications” series that looks at the importance of applications to the growth of OSS within the software market. This article looks specifically at enterprise content management and other content management functionality.



Introduction

It is conventional information-technology (IT) market wisdom that there is four to 10 times as much unstructured data as structured data (the kind typically managed by relational database software) in enterprises and other organizations. There is no estimate for the volume of such unstructured information held by and for individuals; simply think of the typical person’s daily delivery of snail mail plus his or her Hotmail or AOL account, and begin to add it up. The extent to which all such data and information needs to be classified, taxonomized and otherwise organized—and deciding which information needs such treatment and which doesn’t—is an almost immeasurable opportunity for software and IT suppliers says the same conventional wisdom.

Therefore, as with all things immeasurable and unpredictable, users, suppliers and investors are taking it slow. IT staffs do not want to attack an unstructured data project until top management in their organization identifies it as a priority. Suppliers are not interested unless users are; investors follow that lead. Even where IT staffs and suppliers show interest, investors hesitate because of the lack of hard measurements.

As a result, over the history of the IT market only a small percentage of software spending has been devoted to content management despite the conventional wisdom. And almost any company that grows dramatically based on content management functionality—for example Documentum, FileNet and Stellent—is acquired by a larger IT supplier (EMC, IBM and Oracle in these examples), often primarily to use its content management features as underlying technology for some other solution. (That is also why there is the cross over from content management software to the business intelligence software arena, which ebizQ covered in an earlier article in this series, and business process management middleware, which ebizQ will cover early in 2008.)

Open Source Software Will Handle the Content Management Need…Or Prove It Doesn’t Exist

That history of for-profit content management software suppliers and software revenue does not necessarily mean that organizations and individuals are not spending large amounts of money and time managing their unstructured data. One theory holds that the work is simply being done manually and/or with personal productivity software.

That’s what makes content management something that has been an intriguing issue in the IT market for decades. And the sheer size of the perceived problem makes content management an ideal challenge for open source software (OSS) communities and the OSS development model. Even if there is only four times as much structured data as unstructured data, that data is likely fragmented and industry centric, reducing the interest of the rapidly consolidating set of large software suppliers. Where an opportunity crosses many industries and is immeasurable and potentially unprofitable, OSS is a likely means of meeting the need. (If once met, suppliers can make a profit from the community’s development, that is all the better of course.)

OSS even has the advantage of proving conventional wisdom wrong. If the need does not really exist, OSS communities will not form.

Some Definition is Needed…by ebizQ and Users

The complex and immeasurable nature of the content management opportunity is probably the reason that there also seems to be four to 10 times as many acronyms and terms for software that does unstructured data management as for any other single IT concept. Examples include content brokering, content intelligence (which handle data classification and their structures), content management system (CMS), content platforms (which expose application programming interfaces to help manage metadata for unstructured data), enterprise content management (ECM), enterprise information management, knowledge management (KM), natural computation, text mining and the honest but un-marketing-like term, unstructured data management (UDM).

This array of terminology meant that this ebizQ research into content management required a taxonomy before it could even begin. The ebizQ definition is that ECM is related to and a superset of CMS. Although similar, ECM products have features more related to the C (content) than to M (management), whereas CMS products are more about management. In addition, many “content-related” projects/products—open and closed source—are really web site development and management tools, with very little “C” or “M” functionality. Such tools were not included in the ebizQ research even though some of the distinctions are blurry as illustrated in the diagram.

Figure 1. The spectrum of content-related software

One characteristic is important: some form of search is assumed to be part of ECM, and some understanding of context is required of all the packages. It should not just be a matter of tagging. These are two characteristics that IT staffers will want to watch for as well as they look for OSS content management software, depending on their needs. These needs could be single-server intra-company in nature, or multiple-server intercompany. Typically they will be web based but that is not a criterion either.

The Features that Matter in ECM/CMS

ebizQ identified four key features that seemed to make a difference when looking at OSS ECM and CMS with these characteristics/needs in mind:

  • What the context/search functionality is based on
  • How the software is made available in the market
  • How the UDM software works with structured data
  • How the ECM/CMS works with the underlying mid-stack software

Many of the OSS ECM and CMS projects/products reviewed work with Apache Lucene for search functionality. Metadata is typically used to categorize and search content. In addition, Nuxio and others have also chosen to use Apache JackRabbit, an implementation of the Content Repository for Java Technology API (JCR), or to implement the JCR standard (also called Java Specification Request 170) in other ways. In OpenCMS, a template engine enforces a site-wide corporate layout and worldwide web consortium (W3C)-standard compliance for all content.

The way ECM and CMS software gets to market is an important differentiator. As mentioned above, closed source ECM and CMS software has often been underlying technology for other solutions and that trend seems to be continuing with OSS ECM and CMS. Before an IT staff chooses to directly acquire content-related software, it makes sense to see if the staff’s ultimate objective—for example, handling mechanical drawings—has not already been achieved by another software supplier. Alfresco has a partnership with Formtek to provide such software and Jahia is OEM’d in several platforms, such as Managed Object’s Business Dashboard and Conenza’s Social Networking platform.

All the ECM/CMS software that ebizQ looked at works with structured data [(such as data stored in an Oracle relational DBMS (RDBMS)]. For example, Alfresco and many others use the JBoss-affiliated Hibernate OSS to provide data persistence supporting underlying RDBMS’s including MySQL, PostgreSQL, the HSQLDB (the Java effort based on the product/project called the Hypersonic SQL database), Microsoft SQL Server and Oracle. Not all products support all database brands. Both Jahia and Magnolia emphasize their ability to publish to portals following the JSR168 standard. Nuxeo is based on the OSGi-consortium-based architecture, a Java services platform for component management in secure environments.

Also important is how ECM/CMS software interfaces with OSS mid-stack software (e.g., Apache).

The array of mid-stack software certified and/or at least supported provides open choice as described often in ebizQ OSS research. Alfresco runs in a J2EE application and/or JSR168 portals such as Apache Tomcat, the JBoss application server and/or portal, and the Liferay portal, which is released under the MIT OSS license. Jahia interfaces with OSS mid-stack software Hibernate, Apache Pluto, Apache Jetspeed2, EHCache, Lucene, Zimbra’s AJAX libraries, Struts, and Spring among others. ezPublish uses Apache Tomcat and PHP. Of course many but not all of the OSS ECM and CMS products mentioned also interface with closed-source mid-stack software such as BEA WebLogic, the IBM WebSphere Application Server, and Microsoft mid-stack functionality.

OSS Aspects of ECM/CMS Software

Looking at these products for their OSS aspects (in alphabetical order), Alfresco is one of the better known because of over a million downloads and the high profile of its founders, including John Newton who founded Documentum. The product is based on Java, Spring, and web services, and was built using aspect oriented programming (AOP), which John Newton believes was as an important a development decision as the use of OSS was. Alfresco is associated with MyFaces and with the Lucene projects (the latter for its search functionality) as well the projects mentioned above in paragraphs concerning Alfresco’s support for search, mid-stack software and structured data. Alfresco contributes back to all of the above as well as to the JBoss.org’s JBPM and the Java in-process cache software called EHCache.

eZ Systems ezPublish claims 3000 customers and a wide global presence with 85 employees from 23 nationalities working in offices in Norway, Denmark, Germany, Ukraine, France, Belgium and North America. eZ Publish is built on Apache, PHP and PostgreSQL or MySQL and participates in those communities.

Jahia software is used by both Fortune 500 companies as well as major public administrations. As mentioned above, it works with Hibernate, Apache Pluto, Apache Jetspeed2, EHCache, Lucene, Struts, Spring and the Jahia organization contributes back to Jetspeed 2, Apache JackRabbit, Struts common validator, and Bedework, among other OSS communities.

Magnolia has recently adopted the GNU General Public License version 3 and is associated with the open source workflow engine project (OpenWFE), the Maven OSS project management application, and many other communities. Outbound, Magnolia contributes to OpenWFE and of course its own community.

Nuxeo has been in business since 2001 and have delivered complex ECM solutions to many large organizations in Europe. Inbound it uses Lucene, JBoss components (SEAM, jBPM, etc.), Sun Metro (JAXWS), and other OSS and contributes back to JBoss SEAM, and various component efforts (for example, JODConverter). As mentioned above, Nuxeo is built on the OSGi component architecture.

OpenCMS from Alkacon Software is associated with Lucene, FCKeditor (the OSS HTML text editor), Apache commons, MySQL, and Tomcat and contributes back to FCKeditor and Lucene.

As for licensing terms and conditions, some providers—most notably--Nuxio and Alfresco—offer only one version of their software. However most offer both “community editions,” and more traditionally licensed “enterprise editions.” For example, OpenCMS Enterprise Edition (OCEE) is a binary only distribution, but Alkacon is about to release more functionality as a series of additional OSS modules. The first release in this series is the OpenCMS Newsletter Module.

What About the Foundations

No discussion about OSS can ignore the foundation world. As mentioned above, almost every ECM/CMS project uses two or more Apache components. In addition, Apache has a project called UIMA (for unstructured information management) in incubation. UIMA is a framework and software developer kit for developing such applications. Each UIMA component must implement interfaces defined by the framework and must provide self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++. Apache UIMA is an Apache-licensed OSS implementation of the UIMA specification being developed concurrently by a technical committee within the OASIS standards organization.

Another related foundation emerged in 2007 out of long time work at MIT. Supported by HP, it is called the DSpace Foundation. DSpace is an OSS solution for accessing, managing and preserving scholarly works in a digital archive. The foundations says more than 200 projects worldwide are using the software to digitally capture, preserve and share their artifacts, documents, collections and research data.

Author’s Note:

The ebizQ.net OSS product series of articles began in July with an overview of the OSS Taxonomy, explaining the differences ebizQ sees between applications, mid-stack software, and operating infrastructure as well as between open and closed source products. A previous article in this “Application” series covered business intelligence (BI) and the next article in the series will cover OSS enterprise resource planning (ERP). Previous articles in the “Mid-Stack” series have covered the enterprise service bus (ESB)—in two parts--and application and web server software, arguably the functionality area where modern OSS began. Future “Mid-Stack” series articles will introduce OSS integration server/business process management (BPM) middleware products and projects as well as industry-specific mid-stack software.


About the Author

Dennis Byron brings three decades of analyst experience to his role as ebizQ's Community Manager for Improving Business Processes. This community covers Business Process Management (BPM), Process Modeling, Process Analysis, and Business Alert Monitoring (BAM), among other topics.

As Community Manager, Byron will blog and podcast to keep the ebizQ community fully informed on the latest news and breakthroughs relevant to enterprise BPM. Byron will be responsible for bringing you breaking news on BPM daily, writing feature articles and sourcing content from other analysts, industry associations and vendors for publication on ebizQ. Finally, each week, Byron will compile the most important news and views in an e-mail newsletter for ebizQ's ever-growing BPM community.

Byron is ideally suited to the job, as he has researched and analyzed all areas of IT and information-systems use for the past 30 years. Byron looks at BPM market dynamics backed up by facts, while taking into account the perspective of the IT and business person. He is a frequent speaker and moderator on business processes, which will also be one of his roles as Community Manager.

Byron was the ERP and Middleware Analyst with the Datapro division of McGraw-Hill and IDC from 1991 to 2006. In these roles, he was the primary analyst for Business Process Management. He has conducted over 500 specific information-systems case studies. He has contributed to Application Development Trends, IT Business Edge, Research 2.0 and other publications.

Byron is also the principal of IT Investment Research, which is aimed at institutional and individual investors in IT, or anyone who enjoys peering under the covers of "the financials," where large companies and emerging IPOs like to bury their most interesting facts. His main area of interest is investment opportunities in enterprise software.

More by Dennis Byron

About ebizQ

ebizQ's stable of analysts, columnists and bloggers include Beth Gold-Bernstein, David Kelly, Dennis Byron, Joe McKendrick, Brenda Michelson, Mike Rothman, Michael Dortch and many others, who are poised to keep you updated on all integration topics of note. Research is geared for business and IT professionals, vendors, and industry analysts. ebizQ's valuable analysis focuses entirely on business integration technologies, problems, challenges and solutions.