Open Source 
How Open Source Software Can Bring Structure to Unstructured Data
By Dennis Byron, Analyst, ebizQ
,
12/10/2007
Untitled Document
This is part of the continuing ebizQ Open Source Software (OSS) “Applications”
series that looks at the importance of applications to the growth of OSS within
the software market. This article looks specifically at enterprise content
management and other content management functionality.
Introduction
It is conventional information-technology (IT) market wisdom that there is
four to 10 times as much unstructured data as structured data (the kind typically
managed by relational database software) in enterprises and other organizations.
There is no estimate for the volume of such unstructured information held by
and for individuals; simply think of the typical person’s daily delivery
of snail mail plus his or her Hotmail or AOL account, and begin to add it up.
The extent to which all such data and information needs to be classified, taxonomized
and otherwise organized—and deciding which information needs such treatment
and which doesn’t—is an almost immeasurable opportunity for software
and IT suppliers says the same conventional wisdom.
Therefore, as with all things immeasurable and unpredictable, users, suppliers
and investors are taking it slow. IT staffs do not want to attack an unstructured
data project until top management in their organization identifies it as a priority.
Suppliers are not interested unless users are; investors follow that lead. Even
where IT staffs and suppliers show interest, investors hesitate because of the
lack of hard measurements.
As a result, over the history of the IT market only a small percentage of software
spending has been devoted to content management despite the conventional wisdom.
And almost any company that grows dramatically based on content management functionality—for
example Documentum, FileNet and Stellent—is acquired by a larger IT supplier
(EMC, IBM and Oracle in these examples), often primarily to use its content
management features as underlying technology for some other solution. (That
is also why there is the cross over from content management software to the
business intelligence software arena, which ebizQ covered in an earlier article
in this series, and business process management middleware, which ebizQ will
cover early in 2008.)
Open Source Software Will Handle the Content Management Need…Or Prove
It Doesn’t Exist
That history of for-profit content management software suppliers and software
revenue does not necessarily mean that organizations and individuals are not
spending large amounts of money and time managing their unstructured data. One
theory holds that the work is simply being done manually and/or with personal
productivity software.
That’s what makes content management something that has been an intriguing
issue in the IT market for decades. And the sheer size of the perceived problem
makes content management an ideal challenge for open source software (OSS) communities
and the OSS development model. Even if there is only four times as much structured
data as unstructured data, that data is likely fragmented and industry centric,
reducing the interest of the rapidly consolidating set of large software suppliers.
Where an opportunity crosses many industries and is immeasurable and potentially
unprofitable, OSS is a likely means of meeting the need. (If once met, suppliers
can make a profit from the community’s development, that is all the better
of course.)
OSS even has the advantage of proving conventional wisdom wrong. If the need
does not really exist, OSS communities will not form.
Some Definition is Needed…by ebizQ and Users
The complex and immeasurable nature of the content management opportunity is
probably the reason that there also seems to be four to 10 times as many acronyms
and terms for software that does unstructured data management as for any other
single IT concept. Examples include content brokering, content intelligence
(which handle data classification and their structures), content management
system (CMS), content platforms (which expose application programming interfaces
to help manage metadata for unstructured data), enterprise content management
(ECM), enterprise information management, knowledge management (KM), natural
computation, text mining and the honest but un-marketing-like term, unstructured
data management (UDM).
This array of terminology meant that this ebizQ research into content management
required a taxonomy before it could even begin. The ebizQ definition is that
ECM is related to and a superset of CMS. Although similar, ECM products have
features more related to the C (content) than to M (management), whereas CMS
products are more about management. In addition, many “content-related”
projects/products—open and closed source—are really web site development
and management tools, with very little “C” or “M” functionality.
Such tools were not included in the ebizQ research even though some of the distinctions
are blurry as illustrated in the diagram.

Figure 1. The spectrum of content-related software
One characteristic is important: some form of search is assumed to be part
of ECM, and some understanding of context is required of all the packages. It
should not just be a matter of tagging. These are two characteristics that IT
staffers will want to watch for as well as they look for OSS content management
software, depending on their needs. These needs could be single-server intra-company
in nature, or multiple-server intercompany. Typically they will be web based
but that is not a criterion either.
The Features that Matter in ECM/CMS
ebizQ identified four key features that seemed to make a difference when looking
at OSS ECM and CMS with these characteristics/needs in mind:
- What the context/search functionality is based on
- How the software is made available in the market
- How the UDM software works with structured data
- How the ECM/CMS works with the underlying mid-stack software
Many of the OSS ECM and CMS projects/products reviewed work with Apache Lucene
for search functionality. Metadata is typically used to categorize and search
content. In addition, Nuxio and others have also chosen to use Apache JackRabbit,
an implementation of the Content Repository for Java Technology API (JCR), or
to implement the JCR standard (also called Java Specification Request 170) in
other ways. In OpenCMS, a template engine enforces a site-wide corporate layout
and worldwide web consortium (W3C)-standard compliance for all content.
The way ECM and CMS software gets to market is an important differentiator.
As mentioned above, closed source ECM and CMS software has often been underlying
technology for other solutions and that trend seems to be continuing with OSS
ECM and CMS. Before an IT staff chooses to directly acquire content-related
software, it makes sense to see if the staff’s ultimate objective—for
example, handling mechanical drawings—has not already been achieved by
another software supplier. Alfresco has a partnership with Formtek to provide
such software and Jahia is OEM’d in several platforms, such as Managed
Object’s Business Dashboard and Conenza’s Social Networking platform.
All the ECM/CMS software that ebizQ looked at works with structured data [(such
as data stored in an Oracle relational DBMS (RDBMS)]. For example, Alfresco
and many others use the JBoss-affiliated Hibernate OSS to provide data persistence
supporting underlying RDBMS’s including MySQL, PostgreSQL, the HSQLDB (the
Java effort based on the product/project called the Hypersonic SQL database),
Microsoft SQL Server and Oracle. Not all products support all database brands.
Both Jahia and Magnolia emphasize their ability to publish to portals following
the JSR168 standard. Nuxeo is based on the OSGi-consortium-based architecture,
a Java services platform for component management in secure environments.
Also important is how ECM/CMS software interfaces with OSS mid-stack software
(e.g., Apache).
The array of mid-stack software certified and/or at least supported provides
open choice as described often in ebizQ OSS research. Alfresco runs in a J2EE
application and/or JSR168 portals such as Apache Tomcat, the JBoss application
server and/or portal, and the Liferay portal, which is released under the MIT
OSS license. Jahia interfaces with OSS mid-stack software Hibernate, Apache
Pluto, Apache Jetspeed2, EHCache, Lucene, Zimbra’s AJAX libraries, Struts,
and Spring among others. ezPublish uses Apache Tomcat and PHP. Of course many
but not all of the OSS ECM and CMS products mentioned also interface with closed-source
mid-stack software such as BEA WebLogic, the IBM WebSphere Application Server,
and Microsoft mid-stack functionality.
OSS Aspects of ECM/CMS Software
Looking at these products for their OSS aspects (in alphabetical order), Alfresco
is one of the better known because of over a million downloads and the high
profile of its founders, including John Newton who founded Documentum. The product
is based on Java, Spring, and web services, and was built using aspect oriented
programming (AOP), which John Newton believes was as an important a development
decision as the use of OSS was. Alfresco is associated with MyFaces and with
the Lucene projects (the latter for its search functionality) as well the projects
mentioned above in paragraphs concerning Alfresco’s support for search,
mid-stack software and structured data. Alfresco contributes back to all of
the above as well as to the JBoss.org’s JBPM and the Java in-process cache
software called EHCache.
eZ Systems ezPublish claims 3000 customers and a wide global presence with
85 employees from 23 nationalities working in offices in Norway, Denmark, Germany,
Ukraine, France, Belgium and North America. eZ Publish is built on Apache, PHP
and PostgreSQL or MySQL and participates in those communities.
Jahia software is used by both Fortune 500 companies as well as major public
administrations. As mentioned above, it works with Hibernate, Apache Pluto,
Apache Jetspeed2, EHCache, Lucene, Struts, Spring and the Jahia organization
contributes back to Jetspeed 2, Apache JackRabbit, Struts common validator,
and Bedework, among other OSS communities.
Magnolia has recently adopted the GNU General Public License version 3 and
is associated with the open source workflow engine project (OpenWFE), the Maven
OSS project management application, and many other communities. Outbound, Magnolia
contributes to OpenWFE and of course its own community.
Nuxeo has been in business since 2001 and have delivered complex ECM solutions
to many large organizations in Europe. Inbound it uses Lucene, JBoss components
(SEAM, jBPM, etc.), Sun Metro (JAXWS), and other OSS and contributes back to
JBoss SEAM, and various component efforts (for example, JODConverter). As mentioned
above, Nuxeo is built on the OSGi component architecture.
OpenCMS from Alkacon Software is associated with Lucene, FCKeditor (the OSS
HTML text editor), Apache commons, MySQL, and Tomcat and contributes back to
FCKeditor and Lucene.
As for licensing terms and conditions, some providers—most notably--Nuxio
and Alfresco—offer only one version of their software. However most offer
both “community editions,” and more traditionally licensed “enterprise
editions.” For example, OpenCMS Enterprise Edition (OCEE) is a binary only
distribution, but Alkacon is about to release more functionality as a series
of additional OSS modules. The first release in this series is the OpenCMS Newsletter
Module.
What About the Foundations
No discussion about OSS can ignore the foundation world. As mentioned above,
almost every ECM/CMS project uses two or more Apache components. In addition,
Apache has a project called UIMA (for unstructured information management) in
incubation. UIMA is a framework and software developer kit for developing such
applications. Each UIMA component must implement interfaces defined by the framework
and must provide self-describing metadata via XML descriptor files. The framework
manages these components and the data flow between them. Components are written
in Java or C++. Apache UIMA is an Apache-licensed OSS implementation of the
UIMA specification being developed concurrently by a technical committee within
the OASIS standards organization.
Another related foundation emerged in 2007 out of long time work at MIT. Supported
by HP, it is called the DSpace Foundation. DSpace is an OSS solution for accessing,
managing and preserving scholarly works in a digital archive. The foundations
says more than 200 projects worldwide are using the software to digitally capture,
preserve and share their artifacts, documents, collections and research data.
Author’s Note:
The ebizQ.net OSS product series of articles began in July with an overview
of the OSS Taxonomy, explaining the differences ebizQ sees between applications,
mid-stack software, and operating infrastructure as well as between open and
closed source products. A previous article in this “Application” series
covered business intelligence (BI) and the next article in the series will cover
OSS enterprise resource planning (ERP). Previous articles in the “Mid-Stack”
series have covered the enterprise service bus (ESB)—in two parts--and
application and web server software, arguably the functionality area where modern
OSS began. Future “Mid-Stack” series articles will introduce OSS integration
server/business process management (BPM) middleware products and projects as
well as industry-specific mid-stack software.
About the Author
Dennis Byron brings three decades of analyst experience to his role as
ebizQ's Community Manager for Improving Business Processes. This
community covers Business Process Management (BPM), Process Modeling,
Process Analysis, and Business Alert Monitoring (BAM), among other
topics.
As Community Manager, Byron will blog and podcast to keep the ebizQ
community fully informed on the latest news and breakthroughs relevant
to enterprise BPM. Byron will be responsible for bringing you breaking
news on BPM daily, writing feature articles and sourcing content from
other analysts, industry associations and vendors for publication on
ebizQ. Finally, each week, Byron will compile the most important news
and views in an e-mail newsletter for ebizQ's ever-growing BPM
community.
Byron is ideally suited to the job, as he has researched and analyzed
all areas of IT and information-systems use for the past 30 years.
Byron looks at BPM market dynamics backed up by facts, while taking
into account the perspective of the IT and business person. He is a
frequent speaker and moderator on business processes, which will also
be one of his roles as Community Manager.
Byron was the ERP and Middleware Analyst with the Datapro division of
McGraw-Hill and IDC from 1991 to 2006. In these roles, he was the
primary analyst for Business Process Management. He has conducted
over 500 specific information-systems case studies. He has contributed
to Application Development Trends, IT Business Edge, Research 2.0 and
other publications.
Byron is also the principal of IT Investment Research, which is aimed
at institutional and individual investors in IT, or anyone who enjoys
peering under the covers of "the financials," where large companies
and emerging IPOs like to bury their most interesting facts. His main
area of interest is investment opportunities in enterprise software.
More by Dennis ByronAbout ebizQ
ebizQ's stable of analysts, columnists and bloggers include Beth Gold-Bernstein, David Kelly, Dennis Byron, Joe McKendrick, Brenda Michelson, Mike Rothman, Michael Dortch and many others, who are poised to keep you updated on all integration topics of note. Research is geared for business and IT professionals, vendors, and industry analysts. ebizQ's valuable analysis focuses entirely on business integration technologies, problems, challenges and solutions.