69

Gulf of Maine Research Institute

Learning Ecosystems Northeast Website

PostgreSQL Full-Text Search Engine for NASA-Funded Climate Education Platform

Website →
Learning Ecosystems Northeast Website
Date: Jul 2025 - Sep 2025
Project Types:#Upswell#Web
Tech Stacks:#Javascript#Vue.js#TailwindCSS#Python#Django#Wagtail#AWS ECS

📌The Challenge

NASA awarded GMRI a grant to build Learning Ecosystems Northeast (LENE), a new climate education platform serving 50+ Community Learning Ecosystems across 12 states. Requirements:

  • Unified search across 3 content types (Articles, Events, CLE Pages) with sub-100ms response times
  • Faceted filtering by CLE location (50 U.S. states), Article Type, and Climate Topics
  • Headless architecture to serve a Next.js frontend hosted on Cloudflare Pages
  • Related content recommendations (e.g., show 3 related articles on each CLE page)
  • Mixed date handling: Articles have published_date, Events have event_start_datetime, some content has neither
  • Budget constraint: No Elasticsearch—keep infrastructure simple with single PostgreSQL database
Body Image

🛠️The Engineering

1. Database Architecture: DRY with Abstract Base Models

Designed BaseWorkDetailPage as an abstract base to avoid code duplication across Articles, Events, and News:

class BaseWorkDetailPage(LENEBasePage):
    # Add custom search manager
    objects = LENEWorkDetailPageManager()
    search_vector = SearchVectorField(null=True, blank=True)
    
    # Shared fields across all content types
    description = models.TextField(...)
    published_date = models.DateField(null=True, ...)
    event_start_datetime = models.DateTimeField(null=True, ...)
    event_end_datetime = models.DateTimeField(null=True, ...)
    
    # Tag relationships (many-to-many via ClusterTaggableManager)
    cle = ClusterTaggableManager(through=TaggedCLE, ...)
    article_type = ClusterTaggableManager(through=TaggedArticleType, ...)
    collections = ClusterTaggableManager(through=TaggedClimateTopic, ...)
    
    class Meta:
        indexes = [
            GinIndex(fields=['search_vector'], name='work_detail_search_vector_idx')
        ]

This is an illustrative snippet and does not represent the production code.

Architectural decision: Single table inheritance would cause sparse columns. Abstract base model lets Django create separate tables (Articles, Events) with only relevant fields while sharing search logic.

2. PostgreSQL Full-Text Search with Weighted Vectors

Chose PostgreSQL FTS over Elasticsearch to avoid infrastructure complexity:

class LENEWorkDetailPageManager(PageManager):
    def search(self, query, tags=None):
        qs = self.get_queryset()
        
        if query:
            search_query = SearchQuery(query, config='english')
            qs = qs.annotate(
                rank=SearchRank(F('search_vector'), search_query)
            ).filter(search_vector=search_query).order_by('-rank')
        
        if tags:
            # AND across tag types, OR within each type's values
            for tag_type, tag_slugs in tags.items():
                if tag_slugs:
                    qs = qs.filter(**{f"{tag_type}__slug__in": tag_slugs})
        
        return qs.distinct()

This is an illustrative snippet and does not represent the production code.

Technical implementation:

  • GIN index on search_vector for 40-60ms queries
  • Weighted vectors: Title (A weight) > Description (B weight) via SearchVector()
  • Auto-update: Post-save signals trigger update_search_vector() on content changes
  • English config: Stemming support (e.g., "ocean" matches "oceanic", "acidification" matches "acidify")

3. Tag-Based Faceted Search with Dynamic Counts

Designed a 3-tier tag taxonomy with real-time facet count updates:

def list(self, request, *args, **kwargs):
    queryset = self.get_queryset()
    page = self.paginate_queryset(queryset)
    
    # Calculate facet counts based on current result set
    facets = {}
    for field in ['cle', 'article_type', 'collections']:
        facets[field] = list(
            getattr(queryset.model, field).through.objects
            .filter(content_object__in=queryset)
            .values('tag__slug')
            .annotate(count=Count('tag__slug'))
            .order_by('-count')
        )
    
    response = self.get_paginated_response(serializer.data)
    response.data['facets'] = facets
    return response

This is an illustrative snippet and does not represent the production code.

How it works: After filtering by "Maine", facets recalculate to show only article types/topics available for Maine content. Prevents showing "10 articles" for tags with 0 actual results.

4. Multi-Field Ordering with NULLS LAST

Designed conditional ordering logic to handle mixed content types:

# Don't reorder when search ranking applies
if not search_query:
    qs = qs.annotate(
        sort_date=Coalesce(
            'published_date',      # Articles use this
            'event_start_datetime', # Events use this
            output_field=DateTimeField()
        )
    ).order_by(F('sort_date').desc(nulls_last=True))

This is an illustrative snippet and does not represent the production code.

Edge case handling: Content without dates appears at the end, not the top (PostgreSQL default behavior for DESC ordering).

5. Optimized Related Content Queries

Built bidirectional tag relationships for "you might also like" features:

@cached_property
def related_pages(self):
    # Find CLETags pointing to this CLE page
    cle_tag = CLETag.objects.filter(cle_page__pk=self.pk)
    
    # Find articles tagged with those CLETags
    return (
        LENEArticlePage.objects
            .filter(cle__in=cle_tag)
            .live()
            .distinct()
            .order_by('-published_date')[:3]
    )

This is an illustrative snippet and does not represent the production code.

Performance optimization: @cached_property prevents repeated queries during API serialization. Without this, serializing 10 CLE pages would execute 10 separate SQL queries.

6. Headless CMS API Design

Built REST API endpoints using Django REST Framework:

# API routes
gmri_edu_api.register(r'lene/work-detail', LENEWorkDetailViewSet)
gmri_edu_api.register(r'lene/tag/climate-topic', ClimateTopicTagViewSet)
gmri_edu_api.register(r'lene/tag/cle', CLETagViewSet)
gmri_edu_api.register(r'lene/tag/article-type', ArticleTypeTagViewSet)

This is an illustrative snippet and does not represent the production code.

API features:

  • Pagination: 9 items per page with configurable ?size= parameter
  • Filtering: ?cle=maine&article_type=lesson-plan&search=ocean
  • Facet counts: Returned in every response for filter UI
  • CORS-enabled for Cloudflare Pages frontend

7. Tag Architecture with Search Vector Indexing

Designed hierarchical tag system with auto-slugification:

class BaseLENETag(TagBase):
    free_tagging = False  # Admin-curated only
    type = models.CharField(...)
    slug = models.SlugField(...)
    sv = SearchVectorField(...)  # Tags are searchable too
    
    def save(self, *args, **kwargs):
        self.type = type(self).__name__
        self.slug = slugify(self.name)
        super().save(*args, **kwargs)
    
    class Meta:
        indexes = [GinIndex(fields=['sv'])]

This is an illustrative snippet and does not represent the production code.

Tag hierarchy:

  • CLETag: 50 U.S. states with location field + reference to CLE detail page
  • ArticleTypeTag: Lesson plans, activities, case studies
  • ClimateTopicTag: Ocean acidification, sea level rise, etc.

Body Image

🚀The Impact

  • Built from scratch in 3 months
  • 40-60ms search queries for 500+ content items using PostgreSQL GIN indexes
  • Zero Elasticsearch costs—single PostgreSQL database handles all search needs
  • Headless architecture: Django REST API serves Next.js frontend on Cloudflare Pages
  • 3 content types unified under single search interface with tag-based faceting
  • 50+ CLE communities across 12 U.S. states served via location-tagged content
  • Sub-100ms API response times for paginated search with facet counts

✨Key Architectural Decisions

PostgreSQL FTS vs Elasticsearch:

At 500-item scale, Elasticsearch adds complexity (separate cluster, sync logic, monitoring) without performance gains. PostgreSQL GIN indexes provide sub-100ms queries while keeping all data in one database. Trade-off: If we hit 100k+ items, we'd need to migrate to Elasticsearch.

Abstract Base Models vs Single Table Inheritance:

Django supports STI, but it creates sparse columns (e.g., every Event would have unused Article fields). Abstract base models generate separate tables per content type while sharing the search manager and tag relationships. Cleaner schema, no wasted storage.

Custom Manager Pattern:

Encapsulated all search logic in LENEWorkDetailPageManager instead of putting it in views. Benefits: (1) Search logic reusable across API and admin, (2) Easier to test, (3) Follows Django best practices for queryset encapsulation.

Weighted Search Vectors:

Title gets "A" weight, description gets "B" weight. This ensures searching "ocean acidification" ranks articles with that exact title higher than articles mentioning it in body text. Implemented via SearchVector(Value(self.title), weight='A').

Post-save Signal Updates:

Search vectors auto-update on save via Django signals. Alternative (manual updates via management command) would risk stale search indexes. Trade-off: slight save latency (~20ms) for guaranteed index freshness.

Tag Filtering Logic (AND across types):

Implemented chained .filter() calls instead of Q() objects. Selecting "Maine" + "Climate Change" returns items with BOTH tags, not items with either. This makes filtering more restrictive as users add filters (expected UX behavior).