New Projects: Difference between revisions

No edit summary
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 3: Line 3:
The core principle is simple: '''one project = one paper = one repository'''. Everything related to a project lives together, is version-controlled, and is structured so that anyone (including future you) can understand and reproduce the work.
The core principle is simple: '''one project = one paper = one repository'''. Everything related to a project lives together, is version-controlled, and is structured so that anyone (including future you) can understand and reproduce the work.


 
----
== Prerequisites: Learn the Tools ==
== Prerequisites: Learn the Tools ==


Line 23: Line 23:
* [https://docs.conda.io/en/latest/ Conda documentation]
* [https://docs.conda.io/en/latest/ Conda documentation]


 
----
== Step 1: Name Your Project and Create the Repository ==
== Step 1: Name Your Project and Create the Repository ==


Line 41: Line 41:
</code></blockquote>
</code></blockquote>


 
----
== Step 2: Set Up the Directory Structure ==
== Step 2: Set Up the Directory Structure ==


Line 80: Line 80:
A few things to note about this structure. The ''code/'' directory holds reusable Python modules that you import (equivalent to ''src/'' in the Good Research Code Handbook). The ''scripts/'' directory holds analysis scripts and Jupyter notebooks that call functions from ''code/''. The ''data/raw/'' directory is sacred: raw data go in, but nothing ever comes back out modified. All transformations produce new files in ''data/processed/''.
A few things to note about this structure. The ''code/'' directory holds reusable Python modules that you import (equivalent to ''src/'' in the Good Research Code Handbook). The ''scripts/'' directory holds analysis scripts and Jupyter notebooks that call functions from ''code/''. The ''data/raw/'' directory is sacred: raw data go in, but nothing ever comes back out modified. All transformations produce new files in ''data/processed/''.


 
----
== Step 3: Set Up a Virtual Environment ==
== Step 3: Set Up a Virtual Environment ==


Line 110: Line 110:
For more details on managing conda environments (including mixing pip and conda packages), see the [https://goodresearch.dev/setup Good Research Code Handbook: Setup].
For more details on managing conda environments (including mixing pip and conda packages), see the [https://goodresearch.dev/setup Good Research Code Handbook: Setup].


 
----
== Step 4: Make Your Code Pip-Installable ==
== Step 4: Make Your Code Pip-Installable ==


Line 134: Line 134:
Now you can ''import code.my_module' from any script or notebook in the project without worrying about paths. If you change the code, the changes are picked up automatically. See the [https://goodresearch.dev/setup#install-a-project-package Good Research Code Handbook: Setup] for a more detailed walkthrough.
Now you can ''import code.my_module' from any script or notebook in the project without worrying about paths. If you change the code, the changes are picked up automatically. See the [https://goodresearch.dev/setup#install-a-project-package Good Research Code Handbook: Setup] for a more detailed walkthrough.


 
----
== Step 5: Configure ''.gitignore'' and Data Tracking ==
== Step 5: Configure ''.gitignore'' and Data Tracking ==


Line 145: Line 145:
<br>
<br>
<nowiki>#</nowiki> Python<br>
<nowiki>#</nowiki> Python<br>
*.egg-info/<br>
<nowiki>*</nowiki>.egg-info/<br>
__pycache__/<br>
__pycache__/<br>
*.pyc<br>
<nowiki>*</nowiki>.pyc/<br>
.ipynb_checkpoints/<br>
.ipynb_checkpoints/<br>
<br>
<br>
Line 165: Line 165:
If your data are small enough to share directly (< 50 MB of text-based files, e.g., behavioural CSVs), you can keep them in the Git repository and remove ''data/'' from ''.gitignore''.
If your data are small enough to share directly (< 50 MB of text-based files, e.g., behavioural CSVs), you can keep them in the Git repository and remove ''data/'' from ''.gitignore''.


 
----
== Step 6: Organize Your Data ==
== Step 6: Organize Your Data ==


Line 175: Line 175:
</blockquote></code>
</blockquote></code>


The key principles are: use subject and session identifiers consistently, include the task name, separate metadata from data, and never modify raw files. Write a `data/README.md` that documents the naming convention, variable definitions, and any relevant acquisition parameters.
The key principles are: use subject and session identifiers consistently, include the task name, separate metadata from data, and never modify raw files. Write a ''data/README.md'' that documents the naming convention, variable definitions, and any relevant acquisition parameters.
 


----
== Step 7: Keep an Electronic Lab Notebook ==
== Step 7: Keep an Electronic Lab Notebook ==


Use the `docs/labnotebook/` directory for dated Markdown entries. A simple naming convention works well:
Use the ''docs/labnotebook/'' directory for dated Markdown entries. A simple naming convention works well:


<blockquote><code>
<blockquote><code>
Line 189: Line 189:
Each entry should briefly note what you did, what you observed, any decisions you made, and links to relevant scripts or results. These notes are version-controlled along with everything else and provide a timestamped record of the project's evolution.
Each entry should briefly note what you did, what you observed, any decisions you made, and links to relevant scripts or results. These notes are version-controlled along with everything else and provide a timestamped record of the project's evolution.


You can also use tools like [Obsidian](https://obsidian.md/) or [Logseq](https://logseq.com/) for richer note-taking and link them to your repository.
You can also use tools like [https://obsidian.md/ Obsidian] or [https://logseq.com/ Logseq] for richer note-taking and link them to your repository.
 


----
== Step 8: Write a Good README ==
== Step 8: Write a Good README ==


Your `README.md` is the front door to the project. Write it early and update it as the project evolves. It should include:
Your ''README.md'' is the front door to the project. Write it early and update it as the project evolves. It should include:
 
* **Project title and one-paragraph summary** of the scientific question.
* **How to set up the environment** (`conda env create --file environment.yml`).
* **How to reproduce the results** (which scripts to run, in what order).
* **Directory structure** (copy and paste the tree from Step 2 and annotate it).
* **Data availability** (where the data live if not in the repository).
* **Authors and contact information**.
* **Licence**.


* '''Project title and one-paragraph summary''' of the scientific question.
* '''How to set up the environment''' (''conda env create --file environment.yml'').
* '''How to reproduce the results''' (which scripts to run, in what order).
* '''Directory structure''' (copy and paste the tree from Step 2 and annotate it).
* '''Data availability''' (where the data live if not in the repository).
* '''Authors and contact information'''.
* '''Licence'''.


----
== Step 9: Commit Early, Commit Often ==
== Step 9: Commit Early, Commit Often ==


Line 216: Line 216:
</blockquote></code>
</blockquote></code>


If you are not comfortable with the Git command line, the [Git panel in VS Code](https://code.visualstudio.com/docs/sourcecontrol/overview) is an excellent GUI alternative.
If you are not comfortable with the Git command line, the [https://code.visualstudio.com/docs/sourcecontrol/overview Git panel in VS Code] is an excellent GUI alternative.
 


----
== Step 10: Prepare for Publication from Day One ==
== Step 10: Prepare for Publication from Day One ==


The reason we set all of this up at the start is so that sharing is effortless when the paper is ready. At publication time, you should be able to:
The reason we set all of this up at the start is so that sharing is effortless when the paper is ready. At publication time, you should be able to:


1. **Make the GitHub repository public** (or archive it on [Zenodo](https://zenodo.org/) for a citable DOI).
1. '''Make the GitHub repository public''' (or archive it on [https://zenodo.org/ Zenodo] for a citable DOI).
2. **Deposit the data** on a public repository such as [OpenNeuro](https://openneuro.org/) (for BIDS neuroimaging data), [OSF](https://osf.io/), [Figshare](https://figshare.com/), or [Dryad](https://datadryad.org/).
2. '''Deposit the data''' on a public repository such as [https://openneuro.org/ OpenNeuro] (for BIDS neuroimaging data), [https://osf.io/ OSF], [https://figshare.com/ Figshare], or [https://datadryad.org/ Dryad].
3. **Link everything in the paper**: point readers to the code repository, the data repository, and specify the exact environment needed to reproduce the results.
3. '''Link everything in the paper''': point readers to the code repository, the data repository, and specify the exact environment needed to reproduce the results.


If you have followed this guide, all three steps should take minutes rather than days.
If you have followed this guide, all three steps should take minutes rather than days.


 
----
== Quick-Reference Checklist ==
== Quick-Reference Checklist ==


When starting a new project, work through the following:
When starting a new project, work through the following:


* [ ] Create a GitHub repository with README, licence, and `.gitignore`
* [ ] Create a GitHub repository with README, licence, and ''.gitignore''
* [ ] Clone it locally and set up the directory structure
* [ ] Clone it locally and set up the directory structure
* [ ] Create and export a conda environment
* [ ] Create and export a conda environment
* [ ] Create `setup.py` and `pip install -e .`
* [ ] Create ''setup.py'' and ''pip install -e .''
* [ ] Configure `.gitignore` (and DataLad/DVC if needed for large data)
* [ ] Configure ''.gitignore'' (and DataLad/DVC if needed for large data)
* [ ] Write a `data/README.md` documenting your data conventions
* [ ] Write a ''data/README.md'' documenting your data conventions
* [ ] Start your lab notebook with a first entry
* [ ] Start your lab notebook with a first entry
* [ ] Write a draft `README.md` with setup and reproduction instructions
* [ ] Write a draft ''README.md'' with setup and reproduction instructions
* [ ] Make your first commit and push
* [ ] Make your first commit and push


 
----
== Further Reading ==
== Further Reading ==


* Mineault, P. J. (2021). [The Good Research Code Handbook](https://goodresearch.dev/). The essential guide to writing clean, maintainable research code.
* Mineault, P. J. (2021). [https://goodresearch.dev/ The Good Research Code Handbook]. The essential guide to writing clean, maintainable research code.
* Wilson, G. et al. (2017). [Good enough practices in scientific computing](https://doi.org/10.1371/journal.pcbi.1005510). *PLOS Computational Biology*.
* Wilson, G. et al. (2017). [https://doi.org/10.1371/journal.pcbi.1005510 Good enough practices in scientific computing]. PLOS Computational Biology.
* Gorgolewski, K. J. et al. (2016). [The Brain Imaging Data Structure](https://doi.org/10.1038/sdata201644). *Scientific Data*.
* Gorgolewski, K. J. et al. (2016). [https://doi.org/10.1038/sdata201644 The Brain Imaging Data Structure]. Scientific Data.
* Halchenko, Y. O. et al. (2021). [DataLad: distributed system for joint management of code, data, and their relationship](https://doi.org/10.21105/joss.03262). *Journal of Open Source Software*.
* Halchenko, Y. O. et al. (2021). [https://doi.org/10.21105/joss.03262 DataLad: distributed system for joint management of code, data, and their relationship]. Journal of Open Source Software.
* The [DataLad Handbook](https://handbook.datalad.org/) for a comprehensive tutorial on data version control.
* The [https://handbook.datalad.org/ DataLad Handbook] for a comprehensive tutorial on data version control.
* The [Software Carpentry](https://software-carpentry.org/lessons/) lessons for foundational skills in the Unix shell, Git, and Python.
* The [https://software-carpentry.org/lessons/ Software Carpentry] lessons for foundational skills in the Unix shell, Git, and Python.

Latest revision as of 16:20, 9 April 2026

The goal of this guide is to help you set up every new project so that it is well-organized from day one, fully reproducible, and ready to be made public at publication. We follow the philosophy of the Good Research Code Handbook by Patrick Mineault, but extend it to cover the full scope of a research project: experimental data, analysis code, lab notes, figures, posters, and papers.

The core principle is simple: one project = one paper = one repository. Everything related to a project lives together, is version-controlled, and is structured so that anyone (including future you) can understand and reproduce the work.


Prerequisites: Learn the Tools

Before you begin, make sure you have a working knowledge of the following. If you don't, work through the linked tutorials first.

Git and GitHub are the foundation of everything below. Git tracks changes to your files over time; GitHub hosts your repository online and enables collaboration.

The command line is needed for most of the setup steps below. You don't need to be an expert, but you should be comfortable navigating directories and running commands.

Python packaging and environments are essential for making your code portable and reproducible.


Step 1: Name Your Project and Create the Repository

Pick a short, descriptive name. This name will be used for the folder, the GitHub repository, and the installable Python package, so keep it lowercase with no spaces (use hyphens or underscores if needed). For example: `pupil-dynamics`, `pursuit-acceleration`, `meg-connectivity`.

Go to GitHub and create a new repository:

  • Initialize it with a README and a .gitignore (select the Python template).
  • Choose a licence. For open science, we recommend MIT for code-heavy projects or CC-BY 4.0 for data/content-heavy projects.
  • Clone the repository to your local machine:

bash
cd ~/projects
git clone https://github.com/YOUR-USERNAME/project-name.git
cd project-name


Step 2: Set Up the Directory Structure

Create the following folder structure inside your repository:

project-name/
├── code/ # reusable Python modules (your installable package)
│ └── __init__.py
├── data/
│ ├── raw/ # raw, untouched data (never modify these files)
│ └── processed/ # cleaned or transformed data
├── docs/
│ ├── labnotebook/ # electronic lab notes (Markdown files, dated)
│ └── protocols/ # experimental protocols and SOPs
├── results/
│ ├── figures/ # publication-quality figures
│ └── intermediate/ # checkpoints, intermediate outputs
├── scripts/ # analysis scripts and notebooks
├── outputs/
│ ├── papers/ # manuscript drafts (LaTeX or Markdown source)
│ └── posters/ # poster source files
├── tests/ # unit tests for your code
├── environment.yml # conda environment specification
├── setup.py # makes your code pip-installable
└── README.md # project overview and instructions

You can create this in one command:

bash
mkdir -p code data/{raw,processed} docs/{labnotebook,protocols} \ results/{figures,intermediate} scripts outputs/{papers,posters} tests
touch code/__init__.py

A few things to note about this structure. The code/ directory holds reusable Python modules that you import (equivalent to src/ in the Good Research Code Handbook). The scripts/ directory holds analysis scripts and Jupyter notebooks that call functions from code/. The data/raw/ directory is sacred: raw data go in, but nothing ever comes back out modified. All transformations produce new files in data/processed/.


Step 3: Set Up a Virtual Environment

Every project gets its own conda environment. This ensures that your dependencies are documented and that the project can be reproduced on any machine.

bash
conda create --name project-name python=3.11
conda activate project-name
conda install numpy scipy matplotlib pandas seaborn jupyter

Export the environment specification and commit it:

bash
conda env export > environment.yml
git add environment.yml
git commit -m "Add conda environment specification"

Keep environment.yml up to date as you add packages. Anyone can recreate your environment with:

bash
conda env create --file environment.yml

For more details on managing conda environments (including mixing pip and conda packages), see the Good Research Code Handbook: Setup.


Step 4: Make Your Code Pip-Installable

This step avoids the mess of sys.path hacks and makes your modules importable from anywhere in the project. Create a minimal setup.py in the project root:

python
from setuptools import find_packages, setup

setup(
name='project-name',
packages=find_packages(),
)

Then install your package in editable mode:

bash
pip install -e .

Now you can import code.my_module' from any script or notebook in the project without worrying about paths. If you change the code, the changes are picked up automatically. See the Good Research Code Handbook: Setup for a more detailed walkthrough.


Step 5: Configure .gitignore and Data Tracking

Not everything belongs in Git. Add the following to your .gitignore:

# Data (tracked separately, see below)
data/
results/intermediate/

# Python
*.egg-info/
__pycache__/
*.pyc/
.ipynb_checkpoints/

# OS files
.DS_Store
Thumbs.db

# Environment
.env

For large or sensitive data, Git is not the right tool. You have two main options depending on your situation:

  • DataLad (recommended for neuroscience): built on Git and git-annex, it version-controls arbitrarily large files and integrates with BIDS and OpenNeuro. See the DataLad Handbook for a thorough tutorial.
  • DVC (Data Version Control): a lightweight alternative that works well with remote storage backends (S3, Google Drive, etc.).

If your data are small enough to share directly (< 50 MB of text-based files, e.g., behavioural CSVs), you can keep them in the Git repository and remove data/ from .gitignore.


Step 6: Organize Your Data

Follow community standards for data organization wherever possible. For neuroimaging data (MRI, EEG, MEG), use the Brain Imaging Data Structure (BIDS). For behavioural and psychophysics data, adopt a consistent naming convention:

data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_beh.csv
data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_eyetrack.edf

The key principles are: use subject and session identifiers consistently, include the task name, separate metadata from data, and never modify raw files. Write a data/README.md that documents the naming convention, variable definitions, and any relevant acquisition parameters.


Step 7: Keep an Electronic Lab Notebook

Use the docs/labnotebook/ directory for dated Markdown entries. A simple naming convention works well:

docs/labnotebook/2026-03-30_pilot-data-collection.md
docs/labnotebook/2026-04-02_initial-analysis.md

Each entry should briefly note what you did, what you observed, any decisions you made, and links to relevant scripts or results. These notes are version-controlled along with everything else and provide a timestamped record of the project's evolution.

You can also use tools like Obsidian or Logseq for richer note-taking and link them to your repository.


Step 8: Write a Good README

Your README.md is the front door to the project. Write it early and update it as the project evolves. It should include:

  • Project title and one-paragraph summary of the scientific question.
  • How to set up the environment (conda env create --file environment.yml).
  • How to reproduce the results (which scripts to run, in what order).
  • Directory structure (copy and paste the tree from Step 2 and annotate it).
  • Data availability (where the data live if not in the repository).
  • Authors and contact information.
  • Licence.

Step 9: Commit Early, Commit Often

A good rule of thumb is to commit every meaningful unit of work: a new analysis function, a cleaned dataset, a draft of a figure. Each commit should have a short, informative message. Aim for several commits per day when you are actively working.

bash
git add scripts/01_preprocess.py
git commit -m "Add preprocessing pipeline for eye-tracking data"
git push

If you are not comfortable with the Git command line, the Git panel in VS Code is an excellent GUI alternative.


Step 10: Prepare for Publication from Day One

The reason we set all of this up at the start is so that sharing is effortless when the paper is ready. At publication time, you should be able to:

1. Make the GitHub repository public (or archive it on Zenodo for a citable DOI). 2. Deposit the data on a public repository such as OpenNeuro (for BIDS neuroimaging data), OSF, Figshare, or Dryad. 3. Link everything in the paper: point readers to the code repository, the data repository, and specify the exact environment needed to reproduce the results.

If you have followed this guide, all three steps should take minutes rather than days.


Quick-Reference Checklist

When starting a new project, work through the following:

  • [ ] Create a GitHub repository with README, licence, and .gitignore
  • [ ] Clone it locally and set up the directory structure
  • [ ] Create and export a conda environment
  • [ ] Create setup.py and pip install -e .
  • [ ] Configure .gitignore (and DataLad/DVC if needed for large data)
  • [ ] Write a data/README.md documenting your data conventions
  • [ ] Start your lab notebook with a first entry
  • [ ] Write a draft README.md with setup and reproduction instructions
  • [ ] Make your first commit and push

Further Reading