Blohm Lab Wiki

Latest revision as of 16:20, 9 April 2026

The goal of this guide is to help you set up every new project so that it is well-organized from day one, fully reproducible, and ready to be made public at publication. We follow the philosophy of the Good Research Code Handbook by Patrick Mineault, but extend it to cover the full scope of a research project: experimental data, analysis code, lab notes, figures, posters, and papers.

The core principle is simple: one project = one paper = one repository. Everything related to a project lives together, is version-controlled, and is structured so that anyone (including future you) can understand and reproduce the work.

Prerequisites: Learn the Tools

Before you begin, make sure you have a working knowledge of the following. If you don't, work through the linked tutorials first.

Git and GitHub are the foundation of everything below. Git tracks changes to your files over time; GitHub hosts your repository online and enables collaboration.

Software Carpentry: Version Control with Git
GitHub Skills (interactive, browser-based tutorials)
Pro Git Book (comprehensive reference)

The command line is needed for most of the setup steps below. You don't need to be an expert, but you should be comfortable navigating directories and running commands.

Software Carpentry: The Unix Shell

Python packaging and environments are essential for making your code portable and reproducible.

The Good Research Code Handbook (read the whole thing, it's short and very good)
Conda documentation

Step 1: Name Your Project and Create the Repository

Pick a short, descriptive name. This name will be used for the folder, the GitHub repository, and the installable Python package, so keep it lowercase with no spaces (use hyphens or underscores if needed). For example: `pupil-dynamics`, `pursuit-acceleration`, `meg-connectivity`.

Go to GitHub and create a new repository:

Initialize it with a README and a .gitignore (select the Python template).
Choose a licence. For open science, we recommend MIT for code-heavy projects or CC-BY 4.0 for data/content-heavy projects.
Clone the repository to your local machine:

bash cd ~/projects git clone https://github.com/YOUR-USERNAME/project-name.git cd project-name

Step 2: Set Up the Directory Structure

Create the following folder structure inside your repository:

project-name/ ├── code/ # reusable Python modules (your installable package) │ └── __init__.py ├── data/ │ ├── raw/ # raw, untouched data (never modify these files) │ └── processed/ # cleaned or transformed data ├── docs/ │ ├── labnotebook/ # electronic lab notes (Markdown files, dated) │ └── protocols/ # experimental protocols and SOPs ├── results/ │ ├── figures/ # publication-quality figures │ └── intermediate/ # checkpoints, intermediate outputs ├── scripts/ # analysis scripts and notebooks ├── outputs/ │ ├── papers/ # manuscript drafts (LaTeX or Markdown source) │ └── posters/ # poster source files ├── tests/ # unit tests for your code ├── environment.yml # conda environment specification ├── setup.py # makes your code pip-installable └── README.md # project overview and instructions

You can create this in one command:

bash mkdir -p code data/{raw,processed} docs/{labnotebook,protocols} \ results/{figures,intermediate} scripts outputs/{papers,posters} tests touch code/__init__.py

A few things to note about this structure. The code/ directory holds reusable Python modules that you import (equivalent to src/ in the Good Research Code Handbook). The scripts/ directory holds analysis scripts and Jupyter notebooks that call functions from code/. The data/raw/ directory is sacred: raw data go in, but nothing ever comes back out modified. All transformations produce new files in data/processed/.

Step 3: Set Up a Virtual Environment

Every project gets its own conda environment. This ensures that your dependencies are documented and that the project can be reproduced on any machine.

bash conda create --name project-name python=3.11 conda activate project-name conda install numpy scipy matplotlib pandas seaborn jupyter

Export the environment specification and commit it:

bash conda env export > environment.yml git add environment.yml git commit -m "Add conda environment specification"

Keep environment.yml up to date as you add packages. Anyone can recreate your environment with:

bash conda env create --file environment.yml

For more details on managing conda environments (including mixing pip and conda packages), see the Good Research Code Handbook: Setup.

Step 4: Make Your Code Pip-Installable

This step avoids the mess of sys.path hacks and makes your modules importable from anywhere in the project. Create a minimal setup.py in the project root:

python from setuptools import find_packages, setup setup( name='project-name', packages=find_packages(), )

Then install your package in editable mode:

bash pip install -e .

Now you can import code.my_module' from any script or notebook in the project without worrying about paths. If you change the code, the changes are picked up automatically. See the Good Research Code Handbook: Setup for a more detailed walkthrough.

Step 5: Configure .gitignore and Data Tracking

Not everything belongs in Git. Add the following to your .gitignore:

# Data (tracked separately, see below) data/ results/intermediate/ # Python *.egg-info/ __pycache__/ *.pyc/ .ipynb_checkpoints/ # OS files .DS_Store Thumbs.db # Environment .env

For large or sensitive data, Git is not the right tool. You have two main options depending on your situation:

DataLad (recommended for neuroscience): built on Git and git-annex, it version-controls arbitrarily large files and integrates with BIDS and OpenNeuro. See the DataLad Handbook for a thorough tutorial.
DVC (Data Version Control): a lightweight alternative that works well with remote storage backends (S3, Google Drive, etc.).

If your data are small enough to share directly (< 50 MB of text-based files, e.g., behavioural CSVs), you can keep them in the Git repository and remove data/ from .gitignore.

Step 6: Organize Your Data

Follow community standards for data organization wherever possible. For neuroimaging data (MRI, EEG, MEG), use the Brain Imaging Data Structure (BIDS). For behavioural and psychophysics data, adopt a consistent naming convention:

data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_beh.csv data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_eyetrack.edf

The key principles are: use subject and session identifiers consistently, include the task name, separate metadata from data, and never modify raw files. Write a data/README.md that documents the naming convention, variable definitions, and any relevant acquisition parameters.

Step 7: Keep an Electronic Lab Notebook

Use the docs/labnotebook/ directory for dated Markdown entries. A simple naming convention works well:

docs/labnotebook/2026-03-30_pilot-data-collection.md docs/labnotebook/2026-04-02_initial-analysis.md

Each entry should briefly note what you did, what you observed, any decisions you made, and links to relevant scripts or results. These notes are version-controlled along with everything else and provide a timestamped record of the project's evolution.

You can also use tools like Obsidian or Logseq for richer note-taking and link them to your repository.

Step 8: Write a Good README

Your README.md is the front door to the project. Write it early and update it as the project evolves. It should include:

Project title and one-paragraph summary of the scientific question.
How to set up the environment (conda env create --file environment.yml).
How to reproduce the results (which scripts to run, in what order).
Directory structure (copy and paste the tree from Step 2 and annotate it).
Data availability (where the data live if not in the repository).
Authors and contact information.
Licence.

Step 9: Commit Early, Commit Often

A good rule of thumb is to commit every meaningful unit of work: a new analysis function, a cleaned dataset, a draft of a figure. Each commit should have a short, informative message. Aim for several commits per day when you are actively working.

bash git add scripts/01_preprocess.py git commit -m "Add preprocessing pipeline for eye-tracking data" git push

If you are not comfortable with the Git command line, the Git panel in VS Code is an excellent GUI alternative.

Step 10: Prepare for Publication from Day One

The reason we set all of this up at the start is so that sharing is effortless when the paper is ready. At publication time, you should be able to:

1. Make the GitHub repository public (or archive it on Zenodo for a citable DOI). 2. Deposit the data on a public repository such as OpenNeuro (for BIDS neuroimaging data), OSF, Figshare, or Dryad. 3. Link everything in the paper: point readers to the code repository, the data repository, and specify the exact environment needed to reproduce the results.

If you have followed this guide, all three steps should take minutes rather than days.

Quick-Reference Checklist

When starting a new project, work through the following:

[ ] Create a GitHub repository with README, licence, and .gitignore
[ ] Clone it locally and set up the directory structure
[ ] Create and export a conda environment
[ ] Create setup.py and pip install -e .
[ ] Configure .gitignore (and DataLad/DVC if needed for large data)
[ ] Write a data/README.md documenting your data conventions
[ ] Start your lab notebook with a first entry
[ ] Write a draft README.md with setup and reproduction instructions
[ ] Make your first commit and push