The Problem
I maintain a proprietary Drupal distribution that was originally built on Drupal 7 but has recently been rebuilt from scratch on Drupal 8 (and will soon be ported to Drupal 9). As part of building the new platform, one of our stakeholders asked for a feature that we didn’t have in the old version – better handling of searches for date values.
In our platform, we have a field called “Asset date”. This field is stored as a
text field – rather than a traditional Drupal date – because we support
variable precision. That is to say, the “Asset date” could be 1995
, 1995-05
,
or 1995-05-04
, depending upon what data is available about the item that’s
being saved to the database. To make things consistent, we enforce that the
value in this field must be in ISO 8601 format. In other words, it must be in
the format YYYY-MM-DD
, YYYY-MM
, or YYYY
– no other date formats are
accepted.
The unfortunate result of this scheme was that the only way someone could locate
records in our system by its date was to search for the exact date as it
appeared in the record (e.g. 1995-05-04
). The stakeholder wanted more
flexibility – he wanted the ability to search for 1995
or "May 1995"
and
have it return results that started with 1995-05
. In the old system, the way
we used to address requests like this was to provide a date range filter on the
search page, but we felt it might be possible to do better this time around.
Ultimately, we knew that what we needed was for the search index to contain all
the different permutations of the date value. So, for a date like 1995-05-04
,
we wanted the index to contain 1995-05
and 1995
as alternate matches. Then,
to handle searches like May 1995
we wanted to have all the variations on date
formats indexed as well. A naive approach to accomplish this would be for our
team to just specify all the different variations in another indexed field –
like descriptions or notes – but that’s a lot of unnecessary labor for a very
specific search use case; not to mention the fact that such a convention is easy
to forget when adding several records.
The Solution
How could we get Search API to do this for us? Enter Search API “processors”. Processors allow you to apply transformations to the way that data is indexed and queried on your site. So, for our use case, we figured we could write one or more processors that alter each date at indexing time to add all the different permutations.
The Processors
In the end, we actually created two processors. One converts ISO 8601 dates into
their various alternative/synonymous representations according to US and
European standards (e.g. dates like May 1995
, 05/1995
, and 05/04/1995
).
The other converts ISO 8601 dates into their various ISO 8601 permutations
(1995-05-04
becomes 1995-05-04
, 1995-05
, and 1995
).
Below is the code for each processor. Since we leverage Apache Solr provided by
Pantheon, we’re unfortunately limited to Solr v3.6 which means we’re using the
soon-to-be-EOL Search API Solr version 8.x-1.x
. If you are using a newer
version of Search API Solr, your mileage may vary, but hopefully the interface
for processors should still be the same on the Search API side (we wrote these
while using Search API 8.x-1.17
).
Processor #1 - ISO 8601 Date Synonymizer
This processor converts the date format our team enters into the “Asset date” field into their more common, colloquial formats used in the United States and Europe.
The code for this processor can be found below. If you want to use this code,
you’ll want to
create a new custom module
and then define the processor in a file called
src/Plugin/search_api/processor/Iso8601DateSynonymizer.php
inside that module.
<?php
namespace Drupal\YOUR_MODULE\Plugin\search_api\processor;
use Drupal\search_api\Processor\FieldsProcessorPluginBase;
/**
* Adds synonyms/alternate date formats for ISO 8601 dates.
*
* @SearchApiProcessor(
* id = "iso8601_date_synonymizer",
* label = @Translation("ISO 8601 date synonymizer"),
* description = @Translation("Adds alternative representations of ISO 8601-formatted dates at index time. For example, <em>1986-04-01</em> gets indexed as both <em>1986-04-01</em> and <em>April 1, 1986</em>. If using the 'ISO 8601 Date tokenizer', make sure this is invoked before it."),
* stages = {
* "pre_index_save" = -1,
* "preprocess_index" = -1,
* }
* )
*/
class Iso8601DateSynonymizer extends FieldsProcessorPluginBase {
/**
* Matches ISO 8601 dates in the format YYYY-MM-DD, YYYY-MM, or YYYY.
*
* Each date component is captured in a named group.
*/
const REGEX_ISO_8601_DATE_WITH_CAPTURE =
'/^(?<year>[0-9]{4})(?>-(?<month>1[0-2]|0[1-9])(?>-(?<day>3[01]|0[1-9]|[12][0-9]))?)?$/';
/**
* {@inheritdoc}
*/
protected function testType($type) {
// Only apply to "text" types since we return the synonyms as distinct,
// space-separated "words" (e.g. "1986-Apr-01 Apr-1986"), so in order for
// searches to match, the index needs to treat this as full-text. Otherwise,
// the user would have to search for *exactly* "1986-Apr-01 Apr-1986",
// which isn't helpful.
return $this->getDataTypeHelper()->isTextType($type);
}
/**
* {@inheritdoc}
*/
protected function process(&$value) {
// In the absence of the tokenizer processor, this ensures split words.
$words =
preg_split('/\s/', strip_tags($value), -1, PREG_SPLIT_NO_EMPTY);
$new_words = [];
foreach ($words as $word) {
$date_synonyms = $this->getDateSynonyms($word);
// Keep original, unmodified value.
$new_words[] = $word;
if (!empty($date_synonyms)) {
$new_words[] = implode(' ', $date_synonyms);
}
}
$value = implode(' ', $new_words);
}
/**
* Gets all the alternative ways to format the given date string.
*
* @param string $value
* A date string for which synonyms/alternate formats are desired.
*
* @return string[]
* An array containing all the alternate ways to format the given date. For
* example:
* @code
* [
* '04/01/1986',
* '04/01/86',
* '01/04/1986',
* '01/04/86',
* '01-Apr-1986',
* '01 Apr 1986',
* 'April 01, 1986',
* 'Apr-1986',
* 'Apr 1986',
* 'April 1986',
* ]
* @endcode
*/
protected function getDateSynonyms(string $value): array {
$synonyms = [];
if (preg_match(self::REGEX_ISO_8601_DATE_WITH_CAPTURE, $value, $matches)) {
$month_number = $matches['month'] ?? NULL;
$day_number = $matches['day'] ?? NULL;
if ($day_number !== NULL) {
$date = \DateTime::createFromFormat('!Y-m-d', $value);
// US - "04/01/1986".
$synonyms[] = $date->format('m/d/Y');
// US - "04/01/86".
$synonyms[] = $date->format('m/d/y');
// EU - "01/04/1986".
$synonyms[] = $date->format('d/m/Y');
// EU - "01/04/86".
$synonyms[] = $date->format('d/m/y');
// "01-Apr-1986".
$synonyms[] = $date->format('d-M-Y');
// "01 Apr 1986".
$synonyms[] = $date->format('d M Y');
// "April 01, 1986".
$synonyms[] = $date->format('F d, Y');
}
elseif ($month_number !== NULL) {
$date = \DateTime::createFromFormat('!Y-m', $value);
}
if (isset($date)) {
// "Apr-1986".
$synonyms[] = $date->format('M-Y');
// "Apr 1986".
$synonyms[] = $date->format('M Y');
// "April 1986".
$synonyms[] = $date->format('F Y');
}
}
return $synonyms;
}
}
Processor #2 - ISO 8601 Date Tokenizer
This processor converts each date into the various alternate search tokens that should match on the date value.
The code for this processor can be found below. This should go in a file called
src/Plugin/search_api/processor/Iso8601DateTokenizer.php
inside your custom
module.
<?php
namespace Drupal\YOUR_MODULE\Plugin\search_api\processor;
use Drupal\search_api\Processor\FieldsProcessorPluginBase;
/**
* Tokenizes ISO 8601 dates into YYYY-MM-DD, YYYY-MM, and YYYY formats.
*
* @SearchApiProcessor(
* id = "iso8601_date_tokenizer",
* label = @Translation("ISO 8601 date tokenizer"),
* description = @Translation("Tokenizes ISO 8601-formatted dates at index time. For example, <em>1986-04-01</em> gets indexed as <em>1986-04-01</em>, <em>1986-04</em>, and <em>1986</em>."),
* stages = {
* "pre_index_save" = 0,
* "preprocess_index" = 0,
* }
* )
*/
class Iso8601DateTokenizer extends FieldsProcessorPluginBase {
/**
* Matches ISO 8601 dates in the format YYYY-MM-DD, YYYY-MM, or YYYY.
*/
const REGEX_ISO_8601_DATE =
'/^([0-9]{4})(?>-(1[0-2]|0[1-9])(?>-(3[01]|0[1-9]|[12][0-9]))?)?$/';
/**
* {@inheritdoc}
*/
protected function testType($type) {
// Only apply to "text" types since we return the permutations as distinct,
// space-separated "words" (e.g. "1986-04-01 1986-04 1986"), so in order for
// searches to match, the index needs to treat this as full-text. Otherwise,
// the user would have to search for *exactly* "1986-04-01 1986-04 1986",
// which isn't helpful.
return $this->getDataTypeHelper()->isTextType($type);
}
/**
* {@inheritdoc}
*/
protected function process(&$value) {
// In the absence of the tokenizer processor, this ensures split words.
$words =
preg_split('/\s/', strip_tags($value), -1, PREG_SPLIT_NO_EMPTY);
$new_words = [];
foreach ($words as $word) {
$date_permutations = $this->getDatePermutations($word);
if (empty($date_permutations)) {
// Pass through unmodified.
$new_words[] = $word;
}
else {
$new_words[] = implode(' ', $date_permutations);
}
}
$value = implode(' ', $new_words);
}
/**
* Gets all the substrings that should match on the given date string.
*
* @param string $value
* A date string for which token permutations are desired.
*
* @return string[]
* All token permutations for the given date. For example:
* @code
* [
* '1986-04-01',
* '1986-04',
* '1986',
* ]
* @endcode
*/
protected function getDatePermutations(string $value): array {
$permutations = [];
if (preg_match(self::REGEX_ISO_8601_DATE, $value, $matches)) {
$date_parts = array_slice($matches, 1);
$total_part_count = count($date_parts);
for ($perm_index = 0; $perm_index < $total_part_count; ++$perm_index) {
$permutations_parts =
array_slice($date_parts, 0, $total_part_count - $perm_index);
$permutations[] = implode('-', $permutations_parts);
}
}
return $permutations;
}
}
Enabling the Processors
After creating a custom module, adding code to the appropriate files, and adjusting the namespace names in the files to match the name of our module, we are ready to test out the new processors.
Step 1: Enable Your Custom Module or Clear Caches
Before you can set them up, you need to
enable your custom module.
If you added the processors to a module that was already installed (or you
later modify any of the information in the SearchApiProcessor
annotation),
you’ll want to
clear caches/rebuild Drupal’s registry.
Step 3: Configure Date Text Fields for Indexing
Next, you will want to ensure that your date field(s) are indexed as “Fulltext
Unstemmed” on the “Fields” tab of your search index
(/admin/config/search/search-api/index/YOUR_INDEX/fields
). If you index them
as “String” values, searches won’t work properly.
Also note: although these processors were written to work with single-value “Text” fields in ISO 8601 date format, we have tested them with standard Drupal date fields, and they work for those as well as long as the index is configured to index the date fields as “Fulltext Unstemmed” and not “Date” values. In our testing, it appears that when Search API is configured to index a Drupal date as text, it automatically converts the dates into ISO 8601 date strings under the hood.
Step 2: Enable and Configure the New Processors
Now, visit the “Processors” tab of your search index
(e.g. /admin/config/search/search-api/index/YOUR_INDEX/processors
)
and enable both the “ISO 8601 date synonymizer” and “ISO 8601 date tokenizer”
processors. Make sure that the synonymizer runs BEFORE the tokenizer (i.e. in
the “Preprocess index” under “Processor order”, the “ISO 8601 date synonymizer”
should appear above the “ISO 8601 date tokenizer” in the list). Then, under
“Processor settings”, visit the vertical tab for each processor and uncheck all
fields except your date field(s).
Step 4: Re-index content
After making all these changes, remember to re-index your content.
Step 5: Test!
It’s finally time to try out a few searches. For a record containing a date like
1986-04-01
, all of the following searches should match it:
1986-04-01
1986-04
1986
04/01/1986
04/01/86
01/04/1986
01/04/86
01-Apr-1986
01 Apr 1986
April 01, 1986
Apr-1986
Apr 1986
April 1986
You may notice that formats like “1/4/1986” and “4/1/1986” aren’t supported. This is intentional – the synonymizer doesn’t expose one-digit months or days because Solr only indexes words or numbers that are at least two digits.