Robert Muir is a software engineer for Lucid Imagination and a Lucene/Solr committer & PMC member. Robert has posted 2 posts at DZone. You can read more from them at their website. View Full User Profile

Apache Lucene and Solr 3.6 Release! New Language Analysis, Joins, and Finite-State APIs

04.13.2012
| 6710 views |
  • submit to reddit

Lucene / Solr 3.6 has been released and is available for download.

As release manager, here’s my take on the new features:

  • Language analysis:
    • Newly added morphological analysis and part-of-speech tagger for the Japanese Language, geared for search, contributed by Christian Moen.
    • CJK Analysis improvements inspired by the folks at Hathitrust, who are indexing terabytes of text across hundreds of languages with Apache Solr. I encourage you to investigate their blog if you are interested in reading about large-scale search challenges.
    • Lucene/Solr analysis for many languages was tuned and simplified. For instance, to get started with the new Japanese capabilities described above, simply use the text_ja language defined in Solr’s example schema.xml. We configured this for 30 languages out-of-box.
  • Joins:
    • Ability to do index-time block-joins in the opposite direction, useful when you have indexed a parent-child relationship already but sometimes want ungrouped child documents as the result.
    • Addition of query-time joins: an alternative when index-time joins are not feasible.
    • Important bugfixes to index-time joins.
  • Auto-suggest and finite-state APIs:
    • New Weighted FST suggester that offers more fine-grained ranking for suggestions.
    • FST APIs were extended to support reverse-lookups for monotonically increasing outputs, and support n-shortest-path algorithms by weight.
    • Improved suggester API that exploits our incremental automata construction to build suggester FSTs from huge amounts of data.
    • FST compression support, based on research by Lucene/Solr committer Dawid Weiss.
    • Additions to Apache Solr for easier integration of phrase-based auto-suggest, e.g. for previous phrases recorded from query logs.
  • Miscellaneous:
    • A new index pruning module with configurable policies supports faster and smaller indexes that give similar relevance to a complete index.
    • Added phonetic analysis module, for accomplishing sounds-like search: different algorithms and languages are supported from Apache’s commons-codec project.
    • Performance improvements for index splitter tools.
  • Solr improvements:
    • Better defaults and configuration for multi-term queries. Queries such as wildcard queries have better interaction with the analysis chain, especially regarding case- or accent- insensitivity.
    • Distributed date and number range-faceting support.
    • Improved concurrency control for distributed search.
    • SolrJ support for latest HttpComponents release.
    • Clustering improvements: new support for clustering multilingual search results and for clustering on multiple fields.
    • Upgraded Tika integration to 1.0, with improved RTF, Word, and PDF parsing support.
  • Highlighting improvements:
    • A new HTMLStripCharFilter implementation, faster and reliable for matching result snippets to the underlying raw html.
    • Performance improvements for FastVectorHighlighter.
    • Bugfixes to many analysis components that would cause corner-case highlighting bugs.



If you want to hear more about these features, many of the committers who worked on them will be giving talks at Lucene Revolution in Boston, including:

  • Mark Miller will explain the SolrCloud architecture for distributed indexing.
  • Grant Ingersoll will tie together Solr, Hadoop, and Mahout.
  • Martijn van Groningen will be giving a talk about grouping and join features.
  • Erick Erickson will talk about SolrCloud from the user perspective.
  • Christian Moen will be giving a talk introducing Lucene/Solr’s new Japanese language capabilities.
  • Andrzej Bialecki will share adventures into Lucene 4.0′s codec APIs: including updateable fields.
  • Simon Willnauer will discuss some of the challenges of implementing high-performance search in Java.
  • Uwe Schindler will be talking about refactoring of the upcoming Lucene 4.0 IndexReader API.
  • Mike McCandless and I will discuss current and future improvements related to finite-state technology.
  • Chris Hostetter will play the chump, please try to stump him with your questions!


Hope to see you there!

Published at DZone with permission of its author, Robert Muir. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)