New Projects
The goal of this guide is to help you set up every new project so that it is well-organized from day one, fully reproducible, and ready to be made public at publication. We follow the philosophy of the Good Research Code Handbook by Patrick Mineault, but extend it to cover the full scope of a research project: experimental data, analysis code, lab notes, figures, posters, and papers.
The core principle is simple: one project = one paper = one repository. Everything related to a project lives together, is version-controlled, and is structured so that anyone (including future you) can understand and reproduce the work.
Prerequisites: Learn the Tools
Before you begin, make sure you have a working knowledge of the following. If you don't, work through the linked tutorials first.
Git and GitHub are the foundation of everything below. Git tracks changes to your files over time; GitHub hosts your repository online and enables collaboration.
- Software Carpentry: Version Control with Git
- GitHub Skills (interactive, browser-based tutorials)
- Pro Git Book (comprehensive reference)
The command line is needed for most of the setup steps below. You don't need to be an expert, but you should be comfortable navigating directories and running commands.
Python packaging and environments are essential for making your code portable and reproducible.
- The Good Research Code Handbook (read the whole thing, it's short and very good)
- Conda documentation
Step 1: Name Your Project and Create the Repository
Pick a short, descriptive name. This name will be used for the folder, the GitHub repository, and the installable Python package, so keep it lowercase with no spaces (use hyphens or underscores if needed). For example: `pupil-dynamics`, `pursuit-acceleration`, `meg-connectivity`.
Go to GitHub and create a new repository:
- Initialize it with a README and a .gitignore (select the Python template).
- Choose a licence. For open science, we recommend MIT for code-heavy projects or CC-BY 4.0 for data/content-heavy projects.
- Clone the repository to your local machine:
bash
cd ~/projects
git clone https://github.com/YOUR-USERNAME/project-name.git
cd project-name
Step 2: Set Up the Directory Structure
Create the following folder structure inside your repository:
project-name/
├── code/ # reusable Python modules (your installable package)
│ └── __init__.py
├── data/
│ ├── raw/ # raw, untouched data (never modify these files)
│ └── processed/ # cleaned or transformed data
├── docs/
│ ├── labnotebook/ # electronic lab notes (Markdown files, dated)
│ └── protocols/ # experimental protocols and SOPs
├── results/
│ ├── figures/ # publication-quality figures
│ └── intermediate/ # checkpoints, intermediate outputs
├── scripts/ # analysis scripts and notebooks
├── outputs/
│ ├── papers/ # manuscript drafts (LaTeX or Markdown source)
│ └── posters/ # poster source files
├── tests/ # unit tests for your code
├── environment.yml # conda environment specification
├── setup.py # makes your code pip-installable
└── README.md # project overview and instructions
You can create this in one command:
bash
mkdir -p code data/{raw,processed} docs/{labnotebook,protocols} \ results/{figures,intermediate} scripts outputs/{papers,posters} tests
touch code/__init__.py
A few things to note about this structure. The code/ directory holds reusable Python modules that you import (equivalent to src/ in the Good Research Code Handbook). The scripts/ directory holds analysis scripts and Jupyter notebooks that call functions from code/. The data/raw/ directory is sacred: raw data go in, but nothing ever comes back out modified. All transformations produce new files in data/processed/.
Step 3: Set Up a Virtual Environment
Every project gets its own conda environment. This ensures that your dependencies are documented and that the project can be reproduced on any machine.
bash
conda create --name project-name python=3.11
conda activate project-name
conda install numpy scipy matplotlib pandas seaborn jupyter
Export the environment specification and commit it:
bash
conda env export > environment.yml
git add environment.yml
git commit -m "Add conda environment specification"
Keep environment.yml up to date as you add packages. Anyone can recreate your environment with:
bash
conda env create --file environment.yml
For more details on managing conda environments (including mixing pip and conda packages), see the Good Research Code Handbook: Setup.
Step 4: Make Your Code Pip-Installable
This step avoids the mess of sys.path hacks and makes your modules importable from anywhere in the project. Create a minimal setup.py in the project root:
python
from setuptools import find_packages, setup
setup(
name='project-name',
packages=find_packages(),
)
Then install your package in editable mode:
bash
pip install -e .
Now you can import code.my_module' from any script or notebook in the project without worrying about paths. If you change the code, the changes are picked up automatically. See the Good Research Code Handbook: Setup for a more detailed walkthrough.
Step 5: Configure .gitignore and Data Tracking
Not everything belongs in Git. Add the following to your .gitignore:
# Data (tracked separately, see below)
data/
results/intermediate/
# Python
- .egg-info/
__pycache__/
- .pyc
.ipynb_checkpoints/
# OS files
.DS_Store
Thumbs.db
# Environment
.env
For large or sensitive data, Git is not the right tool. You have two main options depending on your situation:
- DataLad (recommended for neuroscience): built on Git and git-annex, it version-controls arbitrarily large files and integrates with BIDS and OpenNeuro. See the DataLad Handbook for a thorough tutorial.
- DVC (Data Version Control): a lightweight alternative that works well with remote storage backends (S3, Google Drive, etc.).
If your data are small enough to share directly (< 50 MB of text-based files, e.g., behavioural CSVs), you can keep them in the Git repository and remove data/ from .gitignore.
Step 6: Organize Your Data
Follow community standards for data organization wherever possible. For neuroimaging data (MRI, EEG, MEG), use the Brain Imaging Data Structure (BIDS). For behavioural and psychophysics data, adopt a consistent naming convention:
data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_beh.csv
data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_eyetrack.edf
The key principles are: use subject and session identifiers consistently, include the task name, separate metadata from data, and never modify raw files. Write a `data/README.md` that documents the naming convention, variable definitions, and any relevant acquisition parameters.
Step 7: Keep an Electronic Lab Notebook
Use the `docs/labnotebook/` directory for dated Markdown entries. A simple naming convention works well:
docs/labnotebook/2026-03-30_pilot-data-collection.md
docs/labnotebook/2026-04-02_initial-analysis.md
Each entry should briefly note what you did, what you observed, any decisions you made, and links to relevant scripts or results. These notes are version-controlled along with everything else and provide a timestamped record of the project's evolution.
You can also use tools like [Obsidian](https://obsidian.md/) or [Logseq](https://logseq.com/) for richer note-taking and link them to your repository.
Step 8: Write a Good README
Your `README.md` is the front door to the project. Write it early and update it as the project evolves. It should include:
- **Project title and one-paragraph summary** of the scientific question.
- **How to set up the environment** (`conda env create --file environment.yml`).
- **How to reproduce the results** (which scripts to run, in what order).
- **Directory structure** (copy and paste the tree from Step 2 and annotate it).
- **Data availability** (where the data live if not in the repository).
- **Authors and contact information**.
- **Licence**.
Step 9: Commit Early, Commit Often
A good rule of thumb is to commit every meaningful unit of work: a new analysis function, a cleaned dataset, a draft of a figure. Each commit should have a short, informative message. Aim for several commits per day when you are actively working.
bash
git add scripts/01_preprocess.py
git commit -m "Add preprocessing pipeline for eye-tracking data"
git push
If you are not comfortable with the Git command line, the [Git panel in VS Code](https://code.visualstudio.com/docs/sourcecontrol/overview) is an excellent GUI alternative.
Step 10: Prepare for Publication from Day One
The reason we set all of this up at the start is so that sharing is effortless when the paper is ready. At publication time, you should be able to:
1. **Make the GitHub repository public** (or archive it on [Zenodo](https://zenodo.org/) for a citable DOI). 2. **Deposit the data** on a public repository such as [OpenNeuro](https://openneuro.org/) (for BIDS neuroimaging data), [OSF](https://osf.io/), [Figshare](https://figshare.com/), or [Dryad](https://datadryad.org/). 3. **Link everything in the paper**: point readers to the code repository, the data repository, and specify the exact environment needed to reproduce the results.
If you have followed this guide, all three steps should take minutes rather than days.
Quick-Reference Checklist
When starting a new project, work through the following:
- [ ] Create a GitHub repository with README, licence, and `.gitignore`
- [ ] Clone it locally and set up the directory structure
- [ ] Create and export a conda environment
- [ ] Create `setup.py` and `pip install -e .`
- [ ] Configure `.gitignore` (and DataLad/DVC if needed for large data)
- [ ] Write a `data/README.md` documenting your data conventions
- [ ] Start your lab notebook with a first entry
- [ ] Write a draft `README.md` with setup and reproduction instructions
- [ ] Make your first commit and push
Further Reading
- Mineault, P. J. (2021). [The Good Research Code Handbook](https://goodresearch.dev/). The essential guide to writing clean, maintainable research code.
- Wilson, G. et al. (2017). [Good enough practices in scientific computing](https://doi.org/10.1371/journal.pcbi.1005510). *PLOS Computational Biology*.
- Gorgolewski, K. J. et al. (2016). [The Brain Imaging Data Structure](https://doi.org/10.1038/sdata201644). *Scientific Data*.
- Halchenko, Y. O. et al. (2021). [DataLad: distributed system for joint management of code, data, and their relationship](https://doi.org/10.21105/joss.03262). *Journal of Open Source Software*.
- The [DataLad Handbook](https://handbook.datalad.org/) for a comprehensive tutorial on data version control.
- The [Software Carpentry](https://software-carpentry.org/lessons/) lessons for foundational skills in the Unix shell, Git, and Python.