« Who will take BPM into the future? | Main | Eclipse and the end of the Microsoft monopoly »
March 15, 2006Make the most of your intranet and extranet
In the next few blog entries I will be discussing open source projects that have reached a certain stage – that of being at least as mature as their commercial competitors. Many organizations are still reluctant to use open source software for mission-critical applications – and in some cases there are good reasons to be suspicious. I will conclude this series of blog entries with a discussion of the potential hazards of adopting certain types of open source software in an enterprise infrastructure.
However, these hazards are not those typically those voiced as objections by IT management – unstable or insecure software, availability of support, and legal issues. The open source projects I will be discussing in the next few entries are perfectly viable from all these perspectives. In general, major open source software applications are written at least as well as leading commercial products (often by the same people), enthusiastically supported by expert and helpful developers (as opposed to knowledge-free call center staff), and transparently licensed (via industry-standard agreements).
Similarly, the advantages of open source are not those typically quoted, either. For example, open source evangelists, being technical folk, make a big deal out of being able to correct or enhance the code yourself if you need to. This is exactly the opposite of what an enterprise is looking for - the very last thing they want is the headache of untangling some huge and complex application when there is a problem. The real benefit of open source is that it is generally more attuned to real-world needs, since open source projects get started precisely in order to meet such needs. While software vendors are grappling with legacy products and market positioning, the open source community just goes off and builds useful stuff.
Stay tuned to this blog for a discussion of the true hazards of open source as regards enterprise adoption, which are to do with the type of project. For now, I will be looking at projects that are not only free from such hazards, but that offer significant advantages over their commercial rivals.
First up is an offering in a space of increasing importance – information retrieval. The success of Google is a measure of how valuable this technique has become in modern life, yet despite the continuing advance of Web search, the facilities that most organizations offer for searching their intranet, or even their public-facing Web site, are little more than pitiful. The ability to retrieve a particular document from the sprawling Web presence (internal or external) of a large company is often more a matter of art than science, and may well depend on knowing in advance the likely path to the data concerned.
This is, of course, a well-known problem, to which knowledge management techniques are often touted as the answer. Indeed, such advanced solutions can be employed to unlock the information archives of a company – but you do not always need the sort of high-end, and very expensive, tools sold for this purpose. For one thing, there are alternative approaches to knowledge management that can be leveraged to expose the knowledge hidden inside information, some of which I discussed in previous blog entries, and I will be returning to this topic in future posts. However, there are also far simpler ways by which the hundreds of thousands of documents available via HTTP on a large company’s servers can be made available – and thus turned into an asset rather than a liability.
I was struck recently by the marketing puff for a commercial search engine, which offers as a “step forward” the ability not only to search structured data (requesting specific values for specific fields) but also to provide flexibility via assigning weights to the different terms in a search. Surprisingly, whoever wrote this PR spiel – and possibly the vendor itself – does not seem to be aware that such “advanced” features have been available for years in the leading open source search engine.
The search engine in question is Lucene, an Apache project. Lucene is not only very well-established but can do some very cool things. For example, Lucene can search via named fields - a feature that in itself offers a step on the way to full knowledge management – and offers wildcard, fuzzy, proximity and range searches. Terms in any of these types of search can be boosted, grouped, and controlled via the use of logical operators.
For a proper explanation of these features, see the links referenced above. The point is that Lucene is very full-featured. However, Lucene is not a search application – it is simply a search engine, a code library that can be used to interrogate any body of text. To use Lucene as the search tool on a Web site, for example, it must be embedded into a product designed for the purpose.
Fortunately, since June 2005 there has been a sub project of Lucene aimed at doing precisely this. The Web crawler and search facility that incorporates Lucene is Nutch. Nutch operates somewhat like Google - in fact Nutch incorporates some technology that Google itself put into the public domain, technology that permits Nutch to crawl, index and search enormously large collections of documents. Moreover, the user interface of the Nutch search facility is, if anything, more helpful than that provided by Google, providing an analysis of how the page ranking was generated along with the results themselves.
If I were a search engine vendor, I would be worried now that Nutch is getting off the ground properly. I have used Nutch in anger, and cannot see any reason to buy commercial software for this purpose now. Simple to install, robust, scales, configurable - what more could one want?
TAKE AWAY
If you work for a large organization, ask yourself whether its public Web site and intranet provide genuinely acceptable search facilities. If not, consider implementing Nutch. It takes – literally – minutes to do the initial installation, and configuration even for a large document base is not complex. And Nutch, like Lucene, comes from the Apache Foundation, one of the most (if not the most) well-established and reliable homes for open source projects.
Tune in next time for a discussion of other open source software with the potential to transform your enterprise IT environment for only a small investment of time and effort.
Posted by keithhb in
Internet
• Knowledge Management
• Open Source
|
Digg This|
Add to del.icio.us
Very good site, congratulations!
Posted by: cashmere at April 13, 2006 08:51 PM
Hi Nice information about open source search
Best regards,
Manisekaran
Posted by: Manisekaran at March 27, 2007 05:43 AM
Post a comment
IT Directions
