Search Settings

Index Structure

The configuration for how the solr index is structured is in the following file:

$SOLR_HOME/{core}/conf/schema.xml

This controls how the text in different fields is handled, both during indexing and search.

Field Types

There are currently four fieldTypes being actively used; they include:

string - this field is indexed and stored verbatim.

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

text_ws - this is a text field that only spits on whitespace for exact matching of words.

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

text_general - this is a general text field with generic cross-language defaults. It tokenizes, converts to lowercase and removes stopwords.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

text_en - this is a text field designed for English free text. It tokenizes, removes stopwords, converts to lowercase and applies English language stemming.

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.ICUTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"/>
    <filter class="solr.ICUFoldingFilterFactory" />
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.ICUTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

Fields

Each field can be set to a particular type, which determines how the field can be searched.

To any field not specifically defined in schema.xml the text_en field type will be applied like so:

    <dynamicField name="*" type="text_en" indexed="true" stored="true" multiValued="true"/>

rels.* and fedora.* fields also have dynamicField configurations, setting them to string fields, since these fields should be preserved as-is.

   <dynamicField name="rels.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <dynamicField name="fedora.*" type="string" indexed="true" stored="true" multiValued="false"/>

The full field configuration is listed below. In general, fields with controlled vocabularies use text_general to avoid false positives due to stemming. The same is true for foxml.all.text and fields that list proper names. For all other fields, especially free text fields such as dc.title and dc.description, a more aggressive stemming policy is in place.

<fields>
    <field name="PID" type="string" indexed="true" stored="true" required="true" /> 
 
    <!-- MODS Fields -->
    <field name="mods.identifier"   type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.title"        type="text_en"  indexed="true"  stored="true" multiValued="true" termVectors="true"/>
    <field name="mods.abstract"     type="text_en"  indexed="true"  stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
    <field name="mods.subtitle"     type="text_en"  indexed="true"  stored="true" multiValued="true" termVectors="true"/>
    <field name="mods.note"         type="text_en"  indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.extent"       type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.typeOfResource" type="string" indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.name"         type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.dateIssued"   type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.dateOther"    type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.dateCreated"  type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.country"      type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.place"        type="string"   indexed="true"  stored="true" multiValued="true" termVectors="true"/>
    <field name="mods.issuance"     type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.topic"        type="string"   indexed="true"  stored="true" multiValued="true" termVectors="true"/>
    <field name="mods.continent"    type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.form"         type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.city"         type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.genre"        type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.geographic"   type="string" indexed="true" stored="true" multiValued="true"/>
    <field name="mods.publisher"    type="string"   indexed="true"  stored="true" multiValued="true"/>
    <field name="mods.physicalLocation" type="string" indexed="true" stored="true" multiValued="true"/>
 
    <field name="mods.identifierQuery" type="text_general" indexed="true" stored="false" multiValued="true" termVectors="true"/>
    <field name="mods.nameQuery" type="text_general" indexed="true" stored="false" multiValued="true" termVectors="true"/>
    <field name="mods.topicQuery" type="text_general" indexed="true" stored="false" multiValued="true" termVectors="true"/>
 
    <dynamicField name="mods.*"     type="string"   indexed="true"  stored="true" multiValued="true"/>
 
    <!-- VRA Core Fields -->
    <field name="vra.measurements" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
    <field name="vra.material" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
    <field name="vra.location" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
    <field name="vra.technique" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
    <field name="vra.inscription" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
    <field name="vra.stylePeriod" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
    <dynamicField name="vra.*" type="string" indexed="true" stored="true" multiValued="true"/>
 
    <!-- Access Control Fields -->
    <field name="access.user"   type="string" indexed="true" stored="false" multiValued="true"/>
    <field name="access.group"  type="string" indexed="true" stored="false" multiValued="true"/>
 
    <field name="ds.obj.text"  type="text_general" indexed="false" stored="false" multiValued="true"/>
    <!-- These ds.*Date fields are used for filtering queries by date range. Because Solr
            doesn't appear to support BCE, i.e. negative, dates, we are using 'double' fields.
            These fields, furthermore, are never exposed to users, and therefore only need to
            preserve a consistent sort order. Plus, the input data from MODS is not typically
            in YYYY-MM-DDTHH:MM:SSZ format; but rather, in YYYY format. -->
    <field name="ds.startDate" type="double" indexed="true" stored="false" multiValued="true"/>
    <field name="ds.endDate"   type="double" indexed="true" stored="false" multiValued="true"/>
 
    <field name="all.text"  type="text_en"        indexed="true"  stored="false" multiValued="true"/>
    <field name="all.terms" type="text_general"   indexed="true"  stored="false" multiValued="true"/>
 
   <field name="fedora.createdDate"      type="date" indexed="true" stored="true" multiValued="false"/>
   <field name="fedora.lastModifiedDate" type="date" indexed="true" stored="true" multiValued="false"/>
 
   <dynamicField name="rels.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <dynamicField name="fedora.*" type="string" indexed="true" stored="true" multiValued="false"/>
   <dynamicField name="*" type="text_en" indexed="true" stored="true" multiValued="true"/>
</fields>

Note that, while Solr supports date fields, I have not had success in using these field types in the context of fedora objects.

Other settings

The schema.xml file defines a number of additional settings. This includes identifying a unique key for each document:

<uniqueKey>PID</uniqueKey>

it also defines a default search field:

<defaultSearchField>all.text</defaultSearchField>

Note that relevant fields are copied into all.text and all.terms like so:

<copyField source="mods.topic" dest="all.text"/>
<copyField source="mods.topic" dest="all.terms"/>

The all.terms field is available for implementing an auto-complete feature (this isn't available at present)

and a default operator:

<solrQueryParser defaultOperator="OR"/>

and finally, a class for defining similarity measures:

<similarity class="org.apache.lucene.search.DefaultSimilarity"/>

Solr Operation

The main operation of solr is defined in

$SOLR_HOME/{core}/conf/solrconfig.xml

For the most part, this file has been unchanged. Below are some of the configuration directives that are particular to this setup.

Search Handlers

Request Handlers are used to process search queries. The one below has been defined specifically for Islandora to use the DisMax Query Parser:

    <requestHandler name="/search" class="solr.SearchHandler" default="true">
        <lst name="defaults">
            <str name="echoParams">explicit</str>
            <int name="rows">10</int>
            <str name="defType">edismax</str>
            <str name="qf">mods.nameQuery^5 mods.topicQuery^2 mods.abstract^2 mods.title^2 mods.identifierQuery^2 all.text^0.5</str>
            <str name="fl">
                PID, mods.title, mods.abstract, mods.name, 
                mods.dateCreated, rels.isMemberOfCollection
            </str>
            <str name="q.alt">PID:0</str>
            <str name="facet.mincount">2</str>
        </lst>
    </requestHandler>

Additional specifications can be added, such as the following:

q.alt identifies an alternate query if no value is passed to the q param. This is set as PID:0 so that an empty query returns no results. Others set this as *:* in order to return all fields. I thought this seemed imprudent.

The mm parameter identifies the minimum number of terms that should match. This is described further here.

The pf parameter can be used to “boost” the score of documents where the terms in the q param appear in close proximity.

The ps and qs parameters identify the amount of “slop” (their word) around a phrase query. For instance, q=happy families alike&ps=10 looks for the words “happy” “families” and “alike” within 10 words of each other. The qs parameter is used for phrase queries that include double quotes, though I am not entirely sure what that means.

Faceted search defaults could also be defined here, though they are already added by Islandora to the query.

Update Handlers

An autoCommit handler has been added so that updates to the index are committed within a certain interval of time:

<updateHandler class="solr.DirectUpdateHandler2">
 <autoCommit>
   <maxDocs>10000</maxDocs>
   <maxTime>5000</maxTime> <!-- 5 seconds -->
 </autoCommit>
</updateHandler>

Indexing FoXML documents

We are using our application messaging system to transform Fedora objects into a SolrDoc via various XSL transformations.

search.txt · Last modified: 2014/09/22 14:02 by acoburn
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International