Custom Algorithms
=================

OpenIMC provides base classes that allow developers to easily integrate novel
segmentation, clustering, and feature extraction algorithms into the framework.
These base classes define clear interfaces with standardized input/output formats,
making it straightforward to add new methods while maintaining compatibility with
the existing OpenIMC pipeline.

Overview
--------

The base classes are located in ``openimc.processing.base`` and provide:

- **Clear interface definitions**: Standardized input/output formats
- **Input validation**: Automatic validation of inputs before processing
- **Output validation**: Automatic validation of outputs after processing
- **Documentation**: Comprehensive docstrings explaining expected formats
- **Error handling**: Consistent error messages and exception types

Base Classes
------------

BaseSegmenter
~~~~~~~~~~~~~

Abstract base class for segmentation algorithms.

**Location**: ``openimc.processing.base.BaseSegmenter``

**Expected Inputs**:

- ``nuclear_image``: ``np.ndarray``, shape ``(H, W)``, dtype ``float32``
  - Preprocessed nuclear channel image (0-1 normalized)
- ``cyto_image``: ``np.ndarray``, shape ``(H, W)``, dtype ``float32``, optional
  - Preprocessed cytoplasm channel image (0-1 normalized)
- ``**kwargs``: Additional algorithm-specific parameters

**Expected Output**:

- ``mask``: ``np.ndarray``, shape ``(H, W)``, dtype ``uint32``
  - Segmentation mask where each cell has a unique integer label
  - ``0`` = background, ``1+`` = cell labels

**Example Implementation**:

.. code-block:: python

    from openimc.processing.base import BaseSegmenter
    import numpy as np
    
    class MyCustomSegmenter(BaseSegmenter):
        def __init__(self):
            super().__init__(name="my_custom_segmenter")
        
        def segment(self, nuclear_image, cyto_image=None, **kwargs):
            # Validate inputs (optional, but recommended)
            self.validate_inputs(nuclear_image, cyto_image)
            
            # Your segmentation algorithm here
            # ...
            
            # Create mask (example: simple thresholding)
            threshold = kwargs.get('threshold', 0.5)
            mask = (nuclear_image > threshold).astype(np.uint32)
            
            # Validate output (optional, but recommended)
            self.validate_output(mask, nuclear_image.shape)
            
            return mask

BaseClusterer
~~~~~~~~~~~~~

Abstract base class for clustering algorithms.

**Location**: ``openimc.processing.base.BaseClusterer``

**Expected Inputs**:

- ``features_df``: ``pd.DataFrame``
  - Feature matrix with one row per cell and one column per feature
  - Required columns: None (all numeric columns are used)
  - Excluded columns: ``'cell_id'``, ``'acquisition_id'``, ``'acquisition_name'``, ``'well'``, ``'cluster'``, ``'label'``, ``'source_file'``, etc.
- ``columns``: ``List[str]``, optional
  - Specific feature columns to use for clustering
  - If ``None``, auto-detects all numeric columns
- ``**kwargs``: Additional algorithm-specific parameters

**Expected Output**:

- ``features_df``: ``pd.DataFrame``
  - Same DataFrame as input with ``'cluster'`` column added
  - ``'cluster'`` column: ``int``, 1-based cluster labels (``0`` = unassigned/noise)

**Example Implementation**:

.. code-block:: python

    from openimc.processing.base import BaseClusterer
    import pandas as pd
    from sklearn.cluster import KMeans
    
    class MyCustomClusterer(BaseClusterer):
        def __init__(self):
            super().__init__(name="my_custom_clusterer")
        
        def cluster(self, features_df, columns=None, **kwargs):
            # Validate and prepare inputs
            data, column_names = self.validate_inputs(features_df, columns)
            original_shape = features_df.shape
            
            # Your clustering algorithm here
            n_clusters = kwargs.get('n_clusters', 5)
            kmeans = KMeans(n_clusters=n_clusters, random_state=42)
            cluster_labels = kmeans.fit_predict(data.values)
            
            # Convert to 1-based labels
            cluster_labels = (cluster_labels + 1).astype(int)
            
            # Add cluster column
            result_df = features_df.copy()
            result_df['cluster'] = cluster_labels
            
            # Validate output
            self.validate_output(result_df, original_shape)
            
            return result_df

BaseFeatureExtractor
~~~~~~~~~~~~~~~~~~~~

Abstract base class for feature extraction algorithms.

**Location**: ``openimc.processing.base.BaseFeatureExtractor``

**Expected Inputs**:

- ``mask``: ``np.ndarray``, shape ``(H, W)``, dtype ``uint32``
  - Segmentation mask with cell labels (``0`` = background, ``1+`` = cells)
- ``image_stack``: ``np.ndarray``, shape ``(H, W, C)``, dtype ``float32``
  - Image stack with ``C`` channels
- ``channel_names``: ``List[str]``, length ``C``
  - Names of each channel in image_stack
- ``**kwargs``: Additional algorithm-specific parameters

**Expected Output**:

- ``features_df``: ``pd.DataFrame``
  - Feature matrix with one row per cell
  - Required columns:
  
    - ``'cell_id'``: ``int``, unique identifier for each cell (1-based)
    - ``'label'``: ``int``, cell label from mask (1-based)
  
  - Additional feature columns (algorithm-specific)

**Example Implementation**:

.. code-block:: python

    from openimc.processing.base import BaseFeatureExtractor
    import numpy as np
    import pandas as pd
    
    class MyCustomFeatureExtractor(BaseFeatureExtractor):
        def __init__(self):
            super().__init__(name="my_custom_extractor")
        
        def extract(self, mask, image_stack, channel_names, **kwargs):
            # Validate inputs
            self.validate_inputs(mask, image_stack, channel_names)
            
            # Get unique cell labels (exclude background = 0)
            unique_labels = np.unique(mask)
            unique_labels = unique_labels[unique_labels > 0]
            
            features_list = []
            for idx, label in enumerate(unique_labels):
                cell_id = idx + 1  # 1-based
                features = {'cell_id': cell_id, 'label': int(label)}
                
                # Create binary mask for this cell
                cell_mask = (mask == label)
                
                # Extract your custom features here
                # ...
                
                features_list.append(features)
            
            # Create DataFrame
            features_df = pd.DataFrame(features_list)
            
            # Validate output
            expected_n_cells = len(unique_labels)
            self.validate_output(features_df, expected_n_cells)
            
            return features_df

Integration with OpenIMC
------------------------

Once you've implemented a custom algorithm, you can integrate it into OpenIMC
by modifying the core functions to support your new algorithm. Here's how:

Segmentation Integration
~~~~~~~~~~~~~~~~~~~~~~~~

Modify ``openimc.core.segment()`` to add your segmenter:

.. code-block:: python

    def segment(loader, acquisition, method, ...):
        # ... existing code ...
        
        elif method == 'my_custom_segmenter':
            from my_module import MyCustomSegmenter
            
            # Preprocess channels (same as other methods)
            nuclear_img, cyto_img = _preprocess_channels_for_segmentation(...)
            
            # Create segmenter instance
            segmenter = MyCustomSegmenter()
            
            # Run segmentation
            mask = segmenter.segment(
                nuclear_img,
                cyto_image=cyto_img,
                threshold=0.5,  # Your custom parameters
                min_cell_area=50
            )
        
        # ... rest of code ...

Clustering Integration
~~~~~~~~~~~~~~~~~~~~~~

Modify ``openimc.core.cluster()`` to add your clusterer:

.. code-block:: python

    def cluster(features_df, method='leiden', ...):
        # ... existing code ...
        
        elif method == 'my_custom_clusterer':
            from my_module import MyCustomClusterer
            
            # Create clusterer instance
            clusterer = MyCustomClusterer()
            
            # Run clustering
            result_df = clusterer.cluster(
                features_df,
                columns=columns,
                n_clusters=5,  # Your custom parameters
                seed=42
            )
            
            return result_df
        
        # ... rest of code ...

Feature Extraction Integration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Modify ``openimc.processing.feature_worker.extract_features_for_acquisition()``
to add your extractor:

.. code-block:: python

    def extract_features_for_acquisition(..., feature_extractor=None):
        # ... existing code ...
        
        if feature_extractor is not None:
            # Use custom extractor
            from my_module import MyCustomFeatureExtractor
            extractor = MyCustomFeatureExtractor()
            features_df = extractor.extract(
                mask,
                img_stack,
                channel_names,
                morphological=True,
                intensity=True
            )
        else:
            # Use default extractor
            # ... existing code ...

Example Implementations
------------------------

Complete example implementations are available in ``openimc.processing.examples``:

- ``ExampleThresholdSegmenter``: Simple thresholding-based segmentation
- ``ExampleKMeansClusterer``: K-means clustering implementation
- ``ExampleBasicFeatureExtractor``: Basic morphological and intensity features

These examples demonstrate:

- Proper input validation
- Correct output format
- Error handling
- Integration patterns

Best Practices
--------------

1. **Always validate inputs**: Use the ``validate_inputs()`` method before processing
2. **Always validate outputs**: Use the ``validate_output()`` method after processing
3. **Handle edge cases**: Empty masks, no cells, missing channels, etc.
4. **Document parameters**: Clearly document all ``**kwargs`` parameters
5. **Preserve data types**: Ensure outputs match expected dtypes (uint32, float32, etc.)
6. **Use 1-based indexing**: Cell IDs and labels should start at 1 (0 = background/unassigned)
7. **Handle memory efficiently**: For large datasets, process in chunks if needed
8. **Provide informative errors**: Raise ``ValueError`` with clear messages for invalid inputs

Common Pitfalls
---------------

1. **Wrong dtype**: Segmentation masks must be ``uint32``, not ``uint8`` or ``int32``
2. **Wrong indexing**: Cell IDs and labels must be 1-based (1, 2, 3, ...), not 0-based
3. **Missing required columns**: Feature DataFrames must have ``'cell_id'`` and ``'label'`` columns
4. **Shape mismatches**: Output shapes must match input shapes
5. **NaN values**: Cluster labels cannot contain NaN (use 0 for unassigned cells)
6. **Background handling**: Background pixels should be labeled as 0 in masks

Testing Your Implementation
----------------------------

Before integrating your algorithm, test it with the base class validation:

.. code-block:: python

    import numpy as np
    from my_module import MyCustomSegmenter
    
    # Create test data
    nuclear_img = np.random.rand(100, 100).astype(np.float32)
    
    # Test segmenter
    segmenter = MyCustomSegmenter()
    mask = segmenter.segment(nuclear_img, threshold=0.5)
    
    # Validation is automatic, but you can also check:
    assert mask.dtype == np.uint32
    assert mask.shape == nuclear_img.shape
    assert mask.min() >= 0

For more complex testing, use the OpenIMC test fixtures and integration tests.