April 22, 2008
Is Your Database Enterprise-Strength?
Regular readers of this blog will be used to my promoting free and/or open source solutions to enterprise software problems. However, there is one area in which I struggle to do so - namely, databases.
Given the ubiquity and importance of relational technology in the workplace, and the array of features offered by open source databases such as MySQL, this may seem a bizarre statement. Yet for many organizations, the primary concern is no longer flexibility or performance, but security:
Many organizations struggle to find a sustainable way to meet global GRC requirements around financial reporting, data security, records retention, risk management, and more.
FACT: Industry analysts AMR Research expect organizations to spend nearly $30 billion this year alone, grappling with questions such as:
- How can we stay on top of increasing regulatory demands while controlling cost?
- How can we better manage risk to prevent business and compliance failures?
- How do we achieve better performance while ensuring accountability and integrity?
Oracle Security Solutions, p.4
In my experience, such concerns are becoming more and more important to CIOs. Yet, in this area, there are few offerings, and little indeed that is free and/or open source.
In particular, if you wish to secure data at row level - so that each row has different access permissions, a normal enough requirement in an enterprise environment - options are few. The best approach appears to be an optional Oracle database feature known as Oracle Label Security (aka OLS). Here is how OLS works:
- First, security policies are established to identify how the data needs to be secured by specification of security components for the policies.
- Next, user labels are established that define what row-level security policies are possible for each user.
- For each table that needs to enforce row-level security, a special column called a label column is built and populated.
- During data access, a process called access mediation determines which permissions are required to access the row, and what actions can be performed on the row once it's accessed.
OLS uses three sets of criteria to define both the set of user's permissions to access data in a row as well as the row's accessibility: levels, compartments, and groups.
Levels. As the first security dimension's name implies, a level defines increasing data sensitivity. A typical example includes the standard security levels (Unclassified, Classified, Secret, and Top Secret). Another example for most companies is human resources information. Just about everyone needs to know everyone else's first and last name and e-mail address (i.e. company-wide access). However, only the employee, her supervisor, and the Human Resources department should know salary information about the employee (hopefully!) only the human resources coordinator should know about an employee's participation in a company-sponsored anger-management class.
Compartments. The second security dimension, a compartment defines the areas to which data access is restricted. In other words, compartments can be used to classify data. Typical examples of compartments include functional divisions within a company (Sales, Accounting, Human Resources, Information Technology).
Groups. A group is the third security dimension. It typically defines who is the owner of the data and provides yet another way to classify what type of access is permitted. However, groups have one important difference: They can be used to restrict access to data based on the owning organization's hierarchical structure. Business rules appropriate for group enforcement within a group include geographical areas (localities within states/provinces, and states/provinces within countries) and sales forces (regions that encompass several districts that themselves encompass territories). What's really great about this feature is that OLS allows me to restrict row-level access to specific nodes of the hierarchy. For example, I can grant a sales force's regional manager access to only sales generated within his region's districts; a district manager access to sales generated only within her district's territories; and a salesperson to only the sales generated within his territory.
Security Component Combinations. For each of the label security components, up to 10,000 different values may be established. OLS requires that, at a minimum, one value for the security level must be stored in each label column, even if it indicates unrestricted access is permitted. Note, however, that compartments and groups need not be included in the label column's value. Also, each row and each user can be assigned multiple access permissions for compartments and groups.
Oracle Label Security, Part 1: Overview, By Jim Czuprynski
Functionality such as OLS should really be part of every database that claims to be enterprise-strength. Perhaps I have missed something, but I cannot see how to achieve equivalent results using (say) DB2, let alone open source alternatives such as MySQL. I believe DB2 has some sort of equivalent to Oracle's Virtual Private Database (VPD - the technology underpinning OLS) in its mainframe edition. But, to my knowledge, that's it, although I have not done proper comparative research in this area.
Further, OLS has been around since 2003, and still has major weaknesses - for instance, support for use of J2EE, since use of OLS via TopLink is currently broken.
TAKE AWAY
I am bemused by the weakness of database offerings with regard to security - especially given the current worldwide focus on combating terrorist threats, the rise of cyber-crime, and the general acknowledgement that the most common threat to organizational security is from insiders.
Your comments are very welcome on this topic! If you have expertise in this area, please share it. I'd be very interested to know your thoughts.
Posted by keithhb in
Open Source
| Permalink
| Comments (0)
December 18, 2006
Open source and the short-term memory of the IT industry
Just back from speaking at Javapolis - a conference I can recommend to any readers with an interest in Java development. Extremely well organized by the Belgian Java User Group (no wonder Brussels is the administrative capital of Europe), multiple tracks all full of solid content, and enough sponsorship to be very reasonably priced. I should think that the value of the freebies and meals provided to attendees must be close to the registration cost of 200 Euros. Book early for next year, as the venue is getting too small for the 3000-odd attendees, so in future it may be first come first served.
My personal reflections after the conference focused on open source. Regular readers of this blog will be expecting the third instalment in the current series on the real-world business issues of SOA. However, it's the week before Christmas, so I will defer the heavy stuff until the New Year, and instead share with you some thoughts on what is happening at the leading edge of software production. And these thoughts are appropriate for the time of year, Christmas being a traditional time to take stock of how some things change and other things stay the same.
I am now aged 43, and have been working in the software industry for nearly 20 years. Many people who enter the industry as developers have moved on by this stage into less technically-oriented roles - consultancy, management, or writing, for example. In fact, I have played such roles myself for a long time now, but make sure also to periodically return to the coal face to refresh my development skills and actually build software, as I am doing at present. However, this is actually quite unusual - by their mid-forties, most people who started out by programming are no longer developing software in their daily work.
The result of this is that developers have, as a community, quite a short memory - and so tend to be unaware of long-term cycles in best practice. For instance, few developers now remember why the 1980s saw a general move away from procedural programming towards object-oriented programming. Hence, they are unaware of the deficiencies in the procedural programming techniques underpinning SOA and BPM, as evinced by languages such as BPEL. Similarly, few developers now seem aware of potential issues with weakly-typed programming languages such as Javascript, or consider the implications of using an old-fashioned mainframe client-server programming model such as AJAX.
Further, a corollary of the youth of programmers is that they tend to take little account of such warnings, when issued by greybeards! This is healthy in some ways, but it is also worth remembering Santayana's famous words on how those who don't remember history are condemned to repeat it.
An interesting example of such cycles is the current stir around open source software. During the Javapolis conference I met and talked with various people engaged in producing and/or using open source software. And it was fascinating to hear not only their evangelical enthusiasm for open source, but also their concerns. These concerns centred around issues such as the following:
- Growing and retaining a developer team
- Growing and retaining a user base
- Maintaining code consistency and quality
- Preventing feature cherry-picking by competitors
- Monetizing products
- Retaining control over products
Sound familiar? There is nothing at all in the above list to differentiate open from closed source. All software vendors have these issues. And therefore, by association, so do their customers.
Many people seem to view open sourcing software as a solution in itself - both a solution for vendors (to gain a community) and a solution for customers (to lower costs). I don't see this happening. In every area of life, you get what you pay for, and enterprise software is no exception. Complex issues of software development and use don't magically go away - they just pop up in slightly different forms.
In fact, the discussions I had at Javapolis made me wonder even what it means to say software is "open source". Is it just about access to program code? Hardly. Enterprise customers of commercial software vendors have always been able to get hold of program source code if/when they need it, either by licensing it or by arranging for it to be held in escrow against the supplier going out of business. Further, most enterprise customers view access to the source of third party applications as a last-ditch option they hope and pray never to need.
So is it about price? Well, again no. Free software may well be closed source. Further, enterprises always have to pay for any software they use, if only in service costs - which have always been been far greater than license costs, as Eric Raymond pointed out so clearly in his seminal work on open source, "The Cathedral and the Bazaar".
How about licensing? A tricky one. There are many variants of open source license, and much debate. However, it is fair to say that most producers of commercially significant open source code are very careful about the license they use - and about the way in which their software is released into the marketplace. Of the concerns listed above, the number one (it seemed to me) was retaining control. Few people really want to let go of their baby, especially if you have invested serious time into it, and that baby is now worth serious money.
Finally, is it about a difference in the way software is written and/or released? Here again, I see no real difference any more. Most significant open source projects and commercial software packages are gravitating towards a halfway house between the agile coder's "release early, release often" mantra and the "seasonal release, periodic patch" approach characteristic of traditional closed source. It is in neither the supplier's nor the customer's interests to adopt either extreme. OK, you can download a nightly drop of Eclipse, but only the diehards will install and try using this for their daily work - and very few people indeed would do so with an unstable release of JBoss application server.
In fact, I would argue that what "open source" is (or was) truly about is the community model of development, in which people from outside the boundary of a single organization actively contribute to the application, and engage with the developers to test it. This model is what I now see breaking down. Most successful open source applications these days are entirely controlled by a single commercial vendor - Sun (Java), IBM (Eclipse), RedHat (JBoss), and so on. Nearly all, if not all, "committers" to such open source projects work for the company concerned. So how are such applications genuinely different from Windows or WebSphere?
I think we may be over halfway through a long-term cycle. The pendulum is already swinging back, away from open source and back towards more old-fashioned models of software production. It is quite possible that as we go forward, it will mean less and less to label a software application as "open source".
Just some seasonal reflections - no TAKE AWAY section this week! Your comments are welcome as always. And have a very Merry Christmas.
Posted by keithhb in
Open Source
| Permalink
| Comments (2)
March 28, 2006
A simple way to evaluate open source
In my consultancy work I have seen many situations in which people who knew they could deliver real business value by utilizing open source tools found their efforts strangled by corporate insistence on lengthy evaluation processes. Apart from the time and effort involved in going through such processes, which in itself is a serious deterrent, the cost of carrying out such processes is so high that you might as well buy commercial software from a leading brand.
Consultants and lawyers are the ones who really love this kind of thing - it keeps them in work, and they can make a good case for such an approach to senior executives by stressing risks. In fact it is partly the FUD (Fear, Uncertainty and Doubt) generated in this way that holds back the adoption of open source tools. The criteria typically listed for evaluating open source tools are in fact equally applicable to commercial software, but usually not applied in anything like such force to organizations that have an "Inc" or "plc" in their name - and it's harder to anyway, not only since commercial companies are much less transparent than the open source community, but also because they may have less of an issue with honesty. Large manufacturing companies may think it normal to hire a business integrity consultant to check out a new parts supplier, but few CIOs do so with their software vendors.
As I commented in the first entry in this blog series, the hazards of open source are not those typically those voiced as objections by consultants anyway – unstable or insecure software, availability of support, and legal issues. The open source projects I have discussed, for example, are perfectly viable from all these perspectives. In general, major open source software applications are written at least as well as leading commercial products (often by the same people), enthusiastically supported by expert and helpful developers (as opposed to knowledge-free call center staff), and transparently licensed (via industry-standard agreements). More to the point, anyone with enough experience in IT knows that leading, expensive commercial products are often deeply buggy, poorly supported and legally vulnerable.
So, how should one evaluate an open source system? Is there an easier way to determine the risk?
In my own software development work, I have used many different open source products, and always go through the same 3-stage process when selecting one.
First, I evaluate my own need as clearly as possible. What is it that I am really looking for? People often work under false assumptions when evaluating new software - assuming, for instance, that they need a database when all they really need is a means of saving objects to permanent storage, or that they need Web service tools that support the full WS-* stack when all they really need is a means of enabling communication between distributed components. Conversely, you may be looking for a tool to help write HTML or XML when you would be better served by a higher-level system that generated such low-level code automatically for you, concealing all the complex details.
This step is the fundamental one if you have any interest in open source. The reason so many open source projects exist is that people start them to meet real-world needs, of which there are an infinite and varied number. Somewhere out there, there is probably someone who has had exactly the same problem as you, who has solved it by writing the tools they needed, and who has then decided they may as well make them available to others via open source. If you are considering open source at all, you may well find you can get exactly what you need, not only what is offered by a limited number of commercial vendors.
So let's assume you have applied such lateral thinking and found a set of tools that meet your needs as closely as possible. Which ones are safe for enterprise adoption?
The second step is to tick some basic boxes - those above, for software stability, availability of support and legality. This is the work of minutes, not days. A project with 1 committer who last posted an update 3 years ago, that has no stated users, and whose license consists of "use this as you wish" is not a good bet - but you can feel secure in choosing a project that:
- Has a number of committers who post regular updates
- Can demonstrate a user base
- Is backed by VCs or major companies
- Uses one of the standard open source licenses (assuming that the conditions of the license do not preclude your intended use of the software).
For projects that fall somewhere between these two extremes, use your common sense - as you would when selecting commercial software.
Having made a shortlist of tools that are both suitable and viable, the third and final step is to ask yourself a single question - and again it is the same question you should ask when evaluating commercial software. What would we do if this project ended? In other words, is it possible either to maintain the open source software in question in-house, or replace it by another product?
Most enterprises would look for the second option, since the current trend in IT is to divest oneself of responsibilities rather than incur new ones. If you are of this mind, here are the things you should bear in mind when answering the question:
- The replacement product does not have to be open source. If you eventually have to switch to a commercial product, you have simply gained some license-free time from initial use of an open source product, which should be more than enough to cover the cost of the migration.
- The replacement product does not have to work in the same way as the original product, or provide the same functionality - it just has to support the usage you intend to make of the open source product. In a simple case, a replacement for your email server only has to support the configuration you have implemented, not every possible configuration available in the product you currently use. For a more complex case, consider JBoss jBPM, for instance, recommended as a free BPM solution in the previous post to this blog. This is packaged as a Java library, that can be used standalone in any Java program or with any application server, can export suitable processes to BPEL, and though the programming model of JBoss jBPM is very powerful (see the last post) it is based on a standard Petri-net paradigm dating back to the early 1960s. So let's suppose the JBoss project ends for some reason, though this is highly unlikely. Existing jBPM processes can be supported for as long as desired in any other Java execution environment, those based on Web service orchestration can be transferred when desired to any BPEL engine, and it is essentially a matter of legwork to port other processes to another workflow system that is likewise based on Petri nets. If you are using the more advanced features of jPDL, you will have to make the odd change here and there, but in the end this is a completely conventional trade-off between functionality and portability - a trade-off that enterprises make every day when utilizing software products, whether open source or commercial.
TAKE AWAY
Open source tools represent a major source of advantage for the enterprise - not just because they are free of license charges (since it is often wise to pay for support anyway), but because in general the open source marketplace offers a range of functionality that commercial vendors struggle to match, hampered as they are by the need to support a legacy product range and provide a consistent offering. All enterprises are well-advised to consider open source when maintaining any aspect of their IT and software development infrastructure, and the good news is that you do not have to pay consultants through the nose to do so - just follow the 3 steps above, and use your own common sense to make practical decisions about the software you adopt.
Posted by keithhb in
Open Source
| Permalink
| Comments (0)
March 23, 2006
BPM for free
This is the 3rd entry in a blog series on open source - this time I will introduce another major open source project of which every enterprise should be aware, and next time I will discuss a quick and simple way to evaluate open source projects in general.
This time it is the turn of JBoss jBPM. There are a number of open source workflow/BPM projects, ranging from simple BPEL engines to fully-fledged suites. However, the JBoss offering is particularly interesting, for 3 reasons:
- JBoss is, these days, about as blue-chip as you can get in the open source world. Not only do they offer a complete range of middleware, but they have download rates that must be the envy of other products (typically hundreds of thousands per month) and are venture-capital backed. Implementing JBoss now is like buying IBM used to be.
- jBPM supports BPEL, but is based on a more powerful programming model - what they call Graph Oriented Programming - with its own process definition language, jPdl. This language was designed from the ground up to support the workflow patterns developed by Prof. Wil van der Aalst and his group.
Why is this important? Because the workflow patterns attempt to represent all possible modes of behaviour in a "programmatic" business process - preset workflows in which human involvement is limited to key decision and data entry points. So if your process definition language can represent all the patterns, you know you can automate just about any such business behaviour using it. As far as I know, jBPM is the only workflow system that claims to support all the patterns.
- jBPM is closely tied to the component architecture of the entire JBoss suite. In my recent blog series on BPM Futures, I discussed how process automation is effectively becoming a graphical interface to component tooling. This approach is very much in line with the architecture of jBPM. Web services, for example, can be employed as desired, but there is no need to move program control flow into a weak language such as BPEL unless it is necessary - you can do as much or as little as you want in Java, for instance.
TAKE AWAY
There is a lot of discussion about BPM at the moment - the debate around jBPM is typically lively - and a lot of this debate is still along the lines of: so what is BPM and why do we need it?
As I have commented before in this blog, despite the views presented by analysts and vendors, anyone working at the coalface in a range of companies will know that by far the bulk of enterprise process automation is still done using ERP packages and component technologies rather than using dedicated workflow/BPM tools.
I discussed in previous entries how new approaches may well change that, as "declarative" techniques enable the automated generation and application of certain types of executable process. In future entries, I will be looking in more detail at the implications of this for enterprise architecture. For now, however, anyone seeking an enterprise-strength process automation system would be well-advised to look at jBPM. Not only is it a powerful open source solution, but it fits naturally with emerging techniques and tools (coming with its own Eclipse plugin, for example).
In the next and concluding entry in this blog series on open source, I will discuss how to decide when you should not adopt an open source approach - without needing to spend a fortune on consultancy advice to find out!
Posted by keithhb in
Business Process Management
• Open Source
| Permalink
| Comments (2)
March 20, 2006
Eclipse and the end of the Microsoft monopoly
This is the second entry in a blog series looking at major open source projects of which every enterprise should be aware. The series will conclude with a discussion of open source adoption issues, providing a quick and simple way to understand when and where to avoid open source.
The subject of this blog entry is the extraordinary software framework known as the Eclipse Platform. Originally developed in-house by IBM, Eclipse was open sourced in 2001 via the creation of a supervisory body known as the Eclipse Foundation, a not-for-profit consortium that now boasts 115 member organizations. IBM still put a lot of money and development time into Eclipse, but for years they have been supported in this effort by some of the largest companies in the world.
So what is Eclipse? In a nutshell, the first software product with the potential to remove Microsoft's monopoly in desktop computing.
IBM set out in the late 90's to create a common software framework to underpin all their Java-based middleware products. Further, the framework had to be extensible via plugin modules. They wanted to establish a standard user interface not only for their own product range, but also for any custom applications or supporting third party tools created to go with products in that range.
To say IBM succeeded would be an understatement. The intellectual effort that went into Eclipse was first-class, with direction from such luminaries as Grady Booch, and taking inspiration from the pattern-based approach to software design that is only now becoming standard practice. There are already many hundreds of plugin modules for Eclipse (open source and otherwise), and the framework is becoming the de facto standard software development environment, not only in the Java world but for other languages too - especially as Eclipse runs on most modern operating systems, in each case with a native look-and-feel. The writing is probably on the wall for competing, non-specialized "IDEs" for software development - even those few (like IntelliJ Idea) that have managed to retain a loyal user base will inevitably find it eroded as the massive vendor support behind Eclipse and the growing number of plugins makes it harder and harder to compete.
So why does this threaten Microsoft? Because of something called the Rich Client Platform.
Eclipse was originally developed as a tools framework - you were supposed to use it to create applications for use by IT folk, for example to write software or configure application servers. It took a surprisingly long time to realize that there was no reason to restrict Eclipse in this way. You can use Eclipse to write any kind of application! And doing so makes a lot of sense. Eclipse has a very well-architected plugin model that means you can leverage the framework to get off the ground very quickly with a new application, simply by building it as an Eclipse plugin. And once you have done so, your own application is immediately compatible with all the hundreds of other Eclipse plugins out there. So with release 3, Eclipse acquired a set of standard features (known as the Rich Client Platform) that make it easy to create any kind of standalone application in this way.
Now let's look back at the history of Microsoft. DOS may have been the company's original means of establishing dominance, but in the last 10 years it has retained it because of something else: applications. In particular, the ubiquity of the Office suite has meant that organizations standardized on Windows in order that their documents would be compatible with those from other sources. But that is only part of the story. For a long time now, it has been almost inconceivable for a software vendor to release a desktop product that did not run on Windows. Hence, by using Windows you were guaranteed that all the software you might want at some time to use would in fact be available to you.
Companies like Google realize this very well, which is why they snapped up the Web-based word processor Writely only shortly after its launch. The more people that use a Web browser as an application platform, the less the importance of choosing Windows for your desktop operating system as opposed to (say) Linux. But whatever the "Web 2.0" evangelists may say, the Web browser is never going to be fully-capable, or even durable, as a platform for building applications. The software development model (AJAX) is:
- Built on sand - Javascript is a scripting language, that was never designed to take the strain of heavyweight software.
- Badly architected - not only is AJAX a kludge on top of HTTP right from the start, but it tends to lead to poor design, breaking the Model 2 MVC practices that were finally becoming mainstream. This is changing as better practices start to appear for AJAX, but there is no layering implicit in AJAX itself, so there will always be the temptation to write poor code.
- Server- and browser-dependent - you cannot run an AJAX application without connecting to the server in question using an appropriately capable browser. What about applications that need to run offline and with more choice of device?
AJAX is unlikely ever to seriously challenge Windows as an application platform - but Eclipse may. Unlike AJAX, Eclipse is founded on an enterprise strength language (Java), based on a soundly architected plugin model, and independent of almost any network and device restrictions. As more and more software vendors port legacy products to Eclipse, and release new products in the form of Eclipse plugins, the importance of choosing Windows for your desktop operating system can only erode, a trend which will be accelerated by the recent emergence of Eclipse sub-projects that provide system-level features such as single sign-on (Higgins) and communications (ECF). The tipping point will probably come when someone releases word processing and spreadsheet Eclipse plugins that can (like Writely) use the emerging open standard for document format as well as the document formats for Microsoft Office.
TAKE AWAY
If you are considering development or maintenance of any desktop software application, whether for in-house use or supply to others, look hard at Eclipse before deciding on a framework. It may well be sensible to build the application as, or port it to, a set of Eclipse plugins. Just so that you know I practice what I preach, we are building our own next generation toolset for collaborative work using Eclipse - and have found it to be very productive, especially the embedded facilities for generating both business logic and diagramming code.
In fact, the same consideration should be given even to server-side applications. Eclipse can run "headless" - without an interactive user interface - and by developing your application as Eclipse plugins you gain access to all the rich functionality of the framework and its existing plugins (many of which are open source). For some reason, the Eclipse Foundation to date have not highlighted this capability of Eclipse, but no doubt they will realize in due course what an opportunity they have been missing and cater to it via a "Rich Server Platform" feature.
Finally (and this is where we came in), the rise of Eclipse has serious implications for any enterprise considering future desktop strategy. There may be less obstacles than you thought to replacing your desktop operating systems with (for example) a Linux variant. Look at what is available now in Eclipse, what is coming, and consider again how much money you will really need to spend on Microsoft licenses in the next few years. Such considerations have a direct impact not only on the desktop machines and licenses you purchase, but on your development strategy - the need for .NET capability along with J2EE may be less than you thought, for example.
Tune in again to this blog for a discussion of another major open source project with the potential to change the enterprise computing landscape. In the next entry I will pick up again on the discussion in the last blog series, BPM Futures, and show how open source tools can be used to implement a forward-thinking BPM strategy.
Posted by keithhb in
Open Source
• Operating Systems
| Permalink
| Comments (10)
March 15, 2006
Make the most of your intranet and extranet
In the next few blog entries I will be discussing open source projects that have reached a certain stage – that of being at least as mature as their commercial competitors. Many organizations are still reluctant to use open source software for mission-critical applications – and in some cases there are good reasons to be suspicious. I will conclude this series of blog entries with a discussion of the potential hazards of adopting certain types of open source software in an enterprise infrastructure.
However, these hazards are not those typically those voiced as objections by IT management – unstable or insecure software, availability of support, and legal issues. The open source projects I will be discussing in the next few entries are perfectly viable from all these perspectives. In general, major open source software applications are written at least as well as leading commercial products (often by the same people), enthusiastically supported by expert and helpful developers (as opposed to knowledge-free call center staff), and transparently licensed (via industry-standard agreements).
Similarly, the advantages of open source are not those typically quoted, either. For example, open source evangelists, being technical folk, make a big deal out of being able to correct or enhance the code yourself if you need to. This is exactly the opposite of what an enterprise is looking for - the very last thing they want is the headache of untangling some huge and complex application when there is a problem. The real benefit of open source is that it is generally more attuned to real-world needs, since open source projects get started precisely in order to meet such needs. While software vendors are grappling with legacy products and market positioning, the open source community just goes off and builds useful stuff.
Stay tuned to this blog for a discussion of the true hazards of open source as regards enterprise adoption, which are to do with the type of project. For now, I will be looking at projects that are not only free from such hazards, but that offer significant advantages over their commercial rivals.
First up is an offering in a space of increasing importance – information retrieval. The success of Google is a measure of how valuable this technique has become in modern life, yet despite the continuing advance of Web search, the facilities that most organizations offer for searching their intranet, or even their public-facing Web site, are little more than pitiful. The ability to retrieve a particular document from the sprawling Web presence (internal or external) of a large company is often more a matter of art than science, and may well depend on knowing in advance the likely path to the data concerned.
This is, of course, a well-known problem, to which knowledge management techniques are often touted as the answer. Indeed, such advanced solutions can be employed to unlock the information archives of a company – but you do not always need the sort of high-end, and very expensive, tools sold for this purpose. For one thing, there are alternative approaches to knowledge management that can be leveraged to expose the knowledge hidden inside information, some of which I discussed in previous blog entries, and I will be returning to this topic in future posts. However, there are also far simpler ways by which the hundreds of thousands of documents available via HTTP on a large company’s servers can be made available – and thus turned into an asset rather than a liability.
I was struck recently by the marketing puff for a commercial search engine, which offers as a “step forward” the ability not only to search structured data (requesting specific values for specific fields) but also to provide flexibility via assigning weights to the different terms in a search. Surprisingly, whoever wrote this PR spiel – and possibly the vendor itself – does not seem to be aware that such “advanced” features have been available for years in the leading open source search engine.
The search engine in question is Lucene, an Apache project. Lucene is not only very well-established but can do some very cool things. For example, Lucene can search via named fields - a feature that in itself offers a step on the way to full knowledge management – and offers wildcard, fuzzy, proximity and range searches. Terms in any of these types of search can be boosted, grouped, and controlled via the use of logical operators.
For a proper explanation of these features, see the links referenced above. The point is that Lucene is very full-featured. However, Lucene is not a search application – it is simply a search engine, a code library that can be used to interrogate any body of text. To use Lucene as the search tool on a Web site, for example, it must be embedded into a product designed for the purpose.
Fortunately, since June 2005 there has been a sub project of Lucene aimed at doing precisely this. The Web crawler and search facility that incorporates Lucene is Nutch. Nutch operates somewhat like Google - in fact Nutch incorporates some technology that Google itself put into the public domain, technology that permits Nutch to crawl, index and search enormously large collections of documents. Moreover, the user interface of the Nutch search facility is, if anything, more helpful than that provided by Google, providing an analysis of how the page ranking was generated along with the results themselves.
If I were a search engine vendor, I would be worried now that Nutch is getting off the ground properly. I have used Nutch in anger, and cannot see any reason to buy commercial software for this purpose now. Simple to install, robust, scales, configurable - what more could one want?
TAKE AWAY
If you work for a large organization, ask yourself whether its public Web site and intranet provide genuinely acceptable search facilities. If not, consider implementing Nutch. It takes – literally – minutes to do the initial installation, and configuration even for a large document base is not complex. And Nutch, like Lucene, comes from the Apache Foundation, one of the most (if not the most) well-established and reliable homes for open source projects.
Tune in next time for a discussion of other open source software with the potential to transform your enterprise IT environment for only a small investment of time and effort.
Posted by keithhb in
Internet
• Knowledge Management
• Open Source
| Permalink
| Comments (2)
|