Computer Vision for Drug Discovery Where Are We and What is Beyond?

Cell line transferability challenge

Overview

In this challenge, you will develop machine learning models for learning representations of molecular perturbations in cellular systems using microscopy imaging data.

More details on kaggle

Description

Learning representations of molecular perturbations in cells is a crucial task in drug discovery. Having robust representations that capture how certain perturbations affect cells would allow us to rapidly find compounds that present similar effects. However, experimental and other types of variability unrelated to the perturbation’s biological effect make it a challenging task. Therefore, we present a compound classification task where models will be evaluated both in unseen plates and unseen cell lines.

Dataset

The competition will use the publicly available 2020_11_04_CPJUMP1 dataset, see description. The dataset contains images generated using chemical and genetic perturbation and 2 cell lines: U2OS and A549 in multiple replicates. Images and corresponding CellProfiler features are available.

Task

Develop a method for transfer of information between cell lines using either CellProfiler features or raw images (Deep Learning method). The representations will be evaluated in a set of 24 plates that contain compound-perturbed cells.

In the file metadata_test.csv, one can find the column ID, which identifies each sample for which participants should submit representations and their corresponding plate and well.

The input file should be a csv file and contain a column called ID, which matches the plate and the well from the metadata_test.csv file, and a set of float columns that will be the features. The set of feature columns can have any length and will be used in the given order. There should be only one feature value per column. Features should not have any missing or infinite values.

Files

metadata_test.csv - Identifier of the samples that should be submitted, along with their corresponfing plates and well positions

sample_submission.csv - A sample submission file in the correct format

Features should not have any missing or infinite values.

Columns

ID - Sample identifier, that corresponds with the plate and well in metadata_test.csv

feature_1 - First feature

…

feature_N - Last feature

References

Chandrasekaran, S.N., Cimini, B.A., Goodale, A. et al. Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations. Nat Methods 21, 1114–1121 (2024). https://doi.org/10.1038/s41592-024-02241-6

Evaluation

The generated representations will be evaluated in a compound classification task using the kNN algorithm in 5-fold cross-validation. Metrics will be calculated using inter and intra cell line matching. Train/test split will be undisclosed but follows plate-wise split. Evaluation will be performed using only compound plates of 2020_11_04_CPJUMP1 dataset and will omit DMSO well but any data can be used to develop and train a method. Participants will also be asked to send a paper describing the methodology (up to 4 pages without citations).

Inter and intra cell line matching:

Train using U2OS data, evaluate using U2OS data.
Train using U2OS data, evaluate using A549 data.
Train using A549 data, evaluate using U2OS data.
Train using A549 data, evaluate using A549 data.

The evaluation system will be opened on February 15th 2024.

Outcomes

Teams that will obtain the best results may be invited to collaborate on a joint publication.

Timeline:

Start (call for participation): January 20 ‘25 07:59 AM UTC

Registration deadline: February 22 ‘25 07:59 AM UTC

Submission deadline: April 1 ‘25 07:59 AM UTC

Decisions to participants: May 1 ‘25 07:59 AM UT

Challenge organizers:

	Adriana Borowa Ardigen SA
	Ana Sanchez-Fernandez Johannes Kepler University Linz

back