New Projects: Difference between revisions
No edit summary |
No edit summary |
||
| (4 intermediate revisions by the same user not shown) | |||
| Line 3: | Line 3: | ||
The core principle is simple: '''one project = one paper = one repository'''. Everything related to a project lives together, is version-controlled, and is structured so that anyone (including future you) can understand and reproduce the work. | The core principle is simple: '''one project = one paper = one repository'''. Everything related to a project lives together, is version-controlled, and is structured so that anyone (including future you) can understand and reproduce the work. | ||
---- | |||
== Prerequisites: Learn the Tools == | == Prerequisites: Learn the Tools == | ||
| Line 23: | Line 23: | ||
* [https://docs.conda.io/en/latest/ Conda documentation] | * [https://docs.conda.io/en/latest/ Conda documentation] | ||
---- | |||
== Step 1: Name Your Project and Create the Repository == | == Step 1: Name Your Project and Create the Repository == | ||
| Line 41: | Line 41: | ||
</code></blockquote> | </code></blockquote> | ||
---- | |||
== Step 2: Set Up the Directory Structure == | == Step 2: Set Up the Directory Structure == | ||
| Line 47: | Line 47: | ||
<blockquote><code> | <blockquote><code> | ||
project-name/ | project-name/<br> | ||
├── code/ # reusable Python modules (your installable package) | ├── code/ # reusable Python modules (your installable package)<br> | ||
│ └── __init__.py | │ └── __init__.py<br> | ||
├── data/ | ├── data/<br> | ||
│ ├── raw/ # raw, untouched data (never modify these files) | │ ├── raw/ # raw, untouched data (never modify these files)<br> | ||
│ └── processed/ # cleaned or transformed data | │ └── processed/ # cleaned or transformed data<br> | ||
├── docs/ | ├── docs/<br> | ||
│ ├── labnotebook/ # electronic lab notes (Markdown files, dated) | │ ├── labnotebook/ # electronic lab notes (Markdown files, dated)<br> | ||
│ └── protocols/ # experimental protocols and SOPs | │ └── protocols/ # experimental protocols and SOPs<br> | ||
├── results/ | ├── results/<br> | ||
│ ├── figures/ # publication-quality figures | │ ├── figures/ # publication-quality figures<br> | ||
│ └── intermediate/ # checkpoints, intermediate outputs | │ └── intermediate/ # checkpoints, intermediate outputs<br> | ||
├── scripts/ # analysis scripts and notebooks | ├── scripts/ # analysis scripts and notebooks<br> | ||
├── outputs/ | ├── outputs/<br> | ||
│ ├── papers/ # manuscript drafts (LaTeX or Markdown source) | │ ├── papers/ # manuscript drafts (LaTeX or Markdown source)<br> | ||
│ └── posters/ # poster source files | │ └── posters/ # poster source files<br> | ||
├── tests/ # unit tests for your code | ├── tests/ # unit tests for your code<br> | ||
├── environment.yml # conda environment specification | ├── environment.yml # conda environment specification<br> | ||
├── setup.py # makes your code pip-installable | ├── setup.py # makes your code pip-installable<br> | ||
└── README.md # project overview and instructions | └── README.md # project overview and instructions | ||
</blockquote></code> | </blockquote></code> | ||
| Line 72: | Line 72: | ||
<blockquote><code> | <blockquote><code> | ||
bash | bash<br> | ||
mkdir -p code data/{raw,processed} docs/{labnotebook,protocols} \ | mkdir -p code data/{raw,processed} docs/{labnotebook,protocols} \ | ||
results/{figures,intermediate} scripts outputs/{papers,posters} tests | results/{figures,intermediate} scripts outputs/{papers,posters} tests<br> | ||
touch code/__init__.py | touch code/__init__.py | ||
</blockquote></code> | </blockquote></code> | ||
A few things to note about this structure. The | A few things to note about this structure. The ''code/'' directory holds reusable Python modules that you import (equivalent to ''src/'' in the Good Research Code Handbook). The ''scripts/'' directory holds analysis scripts and Jupyter notebooks that call functions from ''code/''. The ''data/raw/'' directory is sacred: raw data go in, but nothing ever comes back out modified. All transformations produce new files in ''data/processed/''. | ||
---- | |||
== Step 3: Set Up a Virtual Environment == | == Step 3: Set Up a Virtual Environment == | ||
| Line 86: | Line 86: | ||
<blockquote><code> | <blockquote><code> | ||
bash | bash<br> | ||
conda create --name project-name python=3.11 | conda create --name project-name python=3.11<br> | ||
conda activate project-name | conda activate project-name<br> | ||
conda install numpy scipy matplotlib pandas seaborn jupyter | conda install numpy scipy matplotlib pandas seaborn jupyter | ||
</blockquote></code> | </blockquote></code> | ||
| Line 95: | Line 95: | ||
<blockquote><code> | <blockquote><code> | ||
bash | bash<br> | ||
conda env export > environment.yml | conda env export > environment.yml<br> | ||
git add environment.yml | git add environment.yml<br> | ||
git commit -m "Add conda environment specification" | git commit -m "Add conda environment specification" | ||
</blockquote></code> | </blockquote></code> | ||
Keep | Keep ''environment.yml'' up to date as you add packages. Anyone can recreate your environment with: | ||
<blockquote><code> | <blockquote><code> | ||
bash | bash<br> | ||
conda env create --file environment.yml | conda env create --file environment.yml | ||
</blockquote></code> | </blockquote></code> | ||
For more details on managing conda environments (including mixing pip and conda packages), see the [ | For more details on managing conda environments (including mixing pip and conda packages), see the [https://goodresearch.dev/setup Good Research Code Handbook: Setup]. | ||
---- | |||
== Step 4: Make Your Code Pip-Installable == | == Step 4: Make Your Code Pip-Installable == | ||
This step avoids the mess of | This step avoids the mess of ''sys.path'' hacks and makes your modules importable from anywhere in the project. Create a minimal ''setup.py'' in the project root: | ||
<blockquote><code> | <blockquote><code> | ||
python | python<br> | ||
from setuptools import find_packages, setup | from setuptools import find_packages, setup<br> | ||
<br> | |||
setup( | setup(<br> | ||
name='project-name', | name='project-name',<br> | ||
packages=find_packages(), | packages=find_packages(),<br> | ||
) | ) | ||
</blockquote></code> | </blockquote></code> | ||
| Line 128: | Line 128: | ||
<blockquote><code> | <blockquote><code> | ||
bash | bash<br> | ||
pip install -e . | pip install -e . | ||
</blockquote></code> | </blockquote></code> | ||
Now you can | Now you can ''import code.my_module' from any script or notebook in the project without worrying about paths. If you change the code, the changes are picked up automatically. See the [https://goodresearch.dev/setup#install-a-project-package Good Research Code Handbook: Setup] for a more detailed walkthrough. | ||
== Step 5: Configure | ---- | ||
== Step 5: Configure ''.gitignore'' and Data Tracking == | |||
Not everything belongs in Git. Add the following to your | Not everything belongs in Git. Add the following to your ''.gitignore'': | ||
<blockquote><code> | <blockquote><code> | ||
# Data (tracked separately, see below) | <nowiki>#</nowiki> Data (tracked separately, see below)<br> | ||
data/ | data/<br> | ||
results/intermediate/ | results/intermediate/<br> | ||
<br> | |||
# Python | <nowiki>#</nowiki> Python<br> | ||
*.egg-info/ | <nowiki>*</nowiki>.egg-info/<br> | ||
__pycache__/ | __pycache__/<br> | ||
*.pyc | <nowiki>*</nowiki>.pyc/<br> | ||
.ipynb_checkpoints/ | .ipynb_checkpoints/<br> | ||
<br> | |||
# OS files | <nowiki>#</nowiki> OS files<br> | ||
.DS_Store | .DS_Store<br> | ||
Thumbs.db | Thumbs.db<br> | ||
<br> | |||
# Environment | <nowiki>#</nowiki> Environment<br> | ||
.env | .env | ||
</blockquote></code> | </blockquote></code> | ||
'''For large or sensitive data''', Git is not the right tool. You have two main options depending on your situation: | |||
* [https://www.datalad.org/ DataLad] (recommended for neuroscience): built on Git and git-annex, it version-controls arbitrarily large files and integrates with BIDS and OpenNeuro. See the [https://handbook.datalad.org/ DataLad Handbook] for a thorough tutorial. | |||
* [https://dvc.org/ DVC (Data Version Control)]: a lightweight alternative that works well with remote storage backends (S3, Google Drive, etc.). | |||
If your data are small enough to share directly (< 50 MB of text-based files, e.g., behavioural CSVs), you can keep them in the Git repository and remove ''data/'' from ''.gitignore''. | |||
---- | |||
== Step 6: Organize Your Data == | == Step 6: Organize Your Data == | ||
Follow community standards for data organization wherever possible. For neuroimaging data (MRI, EEG, MEG), use the [ | Follow community standards for data organization wherever possible. For neuroimaging data (MRI, EEG, MEG), use the [https://bids.neuroimaging.io/ Brain Imaging Data Structure (BIDS)]. For behavioural and psychophysics data, adopt a consistent naming convention: | ||
<blockquote><code> | <blockquote><code> | ||
data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_beh.csv | data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_beh.csv<br> | ||
data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_eyetrack.edf | data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_eyetrack.edf | ||
</blockquote></code> | </blockquote></code> | ||
The key principles are: use subject and session identifiers consistently, include the task name, separate metadata from data, and never modify raw files. Write a | The key principles are: use subject and session identifiers consistently, include the task name, separate metadata from data, and never modify raw files. Write a ''data/README.md'' that documents the naming convention, variable definitions, and any relevant acquisition parameters. | ||
---- | |||
== Step 7: Keep an Electronic Lab Notebook == | == Step 7: Keep an Electronic Lab Notebook == | ||
Use the | Use the ''docs/labnotebook/'' directory for dated Markdown entries. A simple naming convention works well: | ||
<blockquote><code> | <blockquote><code> | ||
docs/labnotebook/2026-03-30_pilot-data-collection.md | docs/labnotebook/2026-03-30_pilot-data-collection.md<br> | ||
docs/labnotebook/2026-04-02_initial-analysis.md | docs/labnotebook/2026-04-02_initial-analysis.md | ||
</blockquote></code> | </blockquote></code> | ||
| Line 189: | Line 189: | ||
Each entry should briefly note what you did, what you observed, any decisions you made, and links to relevant scripts or results. These notes are version-controlled along with everything else and provide a timestamped record of the project's evolution. | Each entry should briefly note what you did, what you observed, any decisions you made, and links to relevant scripts or results. These notes are version-controlled along with everything else and provide a timestamped record of the project's evolution. | ||
You can also use tools like [ | You can also use tools like [https://obsidian.md/ Obsidian] or [https://logseq.com/ Logseq] for richer note-taking and link them to your repository. | ||
---- | |||
== Step 8: Write a Good README == | == Step 8: Write a Good README == | ||
Your | Your ''README.md'' is the front door to the project. Write it early and update it as the project evolves. It should include: | ||
* '''Project title and one-paragraph summary''' of the scientific question. | |||
* '''How to set up the environment''' (''conda env create --file environment.yml''). | |||
* '''How to reproduce the results''' (which scripts to run, in what order). | |||
* '''Directory structure''' (copy and paste the tree from Step 2 and annotate it). | |||
* '''Data availability''' (where the data live if not in the repository). | |||
* '''Authors and contact information'''. | |||
* '''Licence'''. | |||
---- | |||
== Step 9: Commit Early, Commit Often == | == Step 9: Commit Early, Commit Often == | ||
| Line 210: | Line 210: | ||
<blockquote><code> | <blockquote><code> | ||
bash | bash<br> | ||
git add scripts/01_preprocess.py | git add scripts/01_preprocess.py<br> | ||
git commit -m "Add preprocessing pipeline for eye-tracking data" | git commit -m "Add preprocessing pipeline for eye-tracking data"<br> | ||
git push | git push | ||
</blockquote></code> | </blockquote></code> | ||
If you are not comfortable with the Git command line, the [ | If you are not comfortable with the Git command line, the [https://code.visualstudio.com/docs/sourcecontrol/overview Git panel in VS Code] is an excellent GUI alternative. | ||
---- | |||
== Step 10: Prepare for Publication from Day One == | == Step 10: Prepare for Publication from Day One == | ||
The reason we set all of this up at the start is so that sharing is effortless when the paper is ready. At publication time, you should be able to: | The reason we set all of this up at the start is so that sharing is effortless when the paper is ready. At publication time, you should be able to: | ||
1. | 1. '''Make the GitHub repository public''' (or archive it on [https://zenodo.org/ Zenodo] for a citable DOI). | ||
2. | 2. '''Deposit the data''' on a public repository such as [https://openneuro.org/ OpenNeuro] (for BIDS neuroimaging data), [https://osf.io/ OSF], [https://figshare.com/ Figshare], or [https://datadryad.org/ Dryad]. | ||
3. | 3. '''Link everything in the paper''': point readers to the code repository, the data repository, and specify the exact environment needed to reproduce the results. | ||
If you have followed this guide, all three steps should take minutes rather than days. | If you have followed this guide, all three steps should take minutes rather than days. | ||
---- | |||
== Quick-Reference Checklist == | == Quick-Reference Checklist == | ||
When starting a new project, work through the following: | When starting a new project, work through the following: | ||
* [ ] Create a GitHub repository with README, licence, and | * [ ] Create a GitHub repository with README, licence, and ''.gitignore'' | ||
* [ ] Clone it locally and set up the directory structure | * [ ] Clone it locally and set up the directory structure | ||
* [ ] Create and export a conda environment | * [ ] Create and export a conda environment | ||
* [ ] Create | * [ ] Create ''setup.py'' and ''pip install -e .'' | ||
* [ ] Configure | * [ ] Configure ''.gitignore'' (and DataLad/DVC if needed for large data) | ||
* [ ] Write a | * [ ] Write a ''data/README.md'' documenting your data conventions | ||
* [ ] Start your lab notebook with a first entry | * [ ] Start your lab notebook with a first entry | ||
* [ ] Write a draft | * [ ] Write a draft ''README.md'' with setup and reproduction instructions | ||
* [ ] Make your first commit and push | * [ ] Make your first commit and push | ||
---- | |||
== Further Reading == | == Further Reading == | ||
* Mineault, P. J. (2021). [ | * Mineault, P. J. (2021). [https://goodresearch.dev/ The Good Research Code Handbook]. The essential guide to writing clean, maintainable research code. | ||
* Wilson, G. et al. (2017). [ | * Wilson, G. et al. (2017). [https://doi.org/10.1371/journal.pcbi.1005510 Good enough practices in scientific computing]. PLOS Computational Biology. | ||
* Gorgolewski, K. J. et al. (2016). [ | * Gorgolewski, K. J. et al. (2016). [https://doi.org/10.1038/sdata201644 The Brain Imaging Data Structure]. Scientific Data. | ||
* Halchenko, Y. O. et al. (2021). [DataLad: distributed system for joint management of code, data, and their relationship] | * Halchenko, Y. O. et al. (2021). [https://doi.org/10.21105/joss.03262 DataLad: distributed system for joint management of code, data, and their relationship]. Journal of Open Source Software. | ||
* The [ | * The [https://handbook.datalad.org/ DataLad Handbook] for a comprehensive tutorial on data version control. | ||
* The [ | * The [https://software-carpentry.org/lessons/ Software Carpentry] lessons for foundational skills in the Unix shell, Git, and Python. | ||
Latest revision as of 16:20, 9 April 2026
The goal of this guide is to help you set up every new project so that it is well-organized from day one, fully reproducible, and ready to be made public at publication. We follow the philosophy of the Good Research Code Handbook by Patrick Mineault, but extend it to cover the full scope of a research project: experimental data, analysis code, lab notes, figures, posters, and papers.
The core principle is simple: one project = one paper = one repository. Everything related to a project lives together, is version-controlled, and is structured so that anyone (including future you) can understand and reproduce the work.
Prerequisites: Learn the Tools
Before you begin, make sure you have a working knowledge of the following. If you don't, work through the linked tutorials first.
Git and GitHub are the foundation of everything below. Git tracks changes to your files over time; GitHub hosts your repository online and enables collaboration.
- Software Carpentry: Version Control with Git
- GitHub Skills (interactive, browser-based tutorials)
- Pro Git Book (comprehensive reference)
The command line is needed for most of the setup steps below. You don't need to be an expert, but you should be comfortable navigating directories and running commands.
Python packaging and environments are essential for making your code portable and reproducible.
- The Good Research Code Handbook (read the whole thing, it's short and very good)
- Conda documentation
Step 1: Name Your Project and Create the Repository
Pick a short, descriptive name. This name will be used for the folder, the GitHub repository, and the installable Python package, so keep it lowercase with no spaces (use hyphens or underscores if needed). For example: `pupil-dynamics`, `pursuit-acceleration`, `meg-connectivity`.
Go to GitHub and create a new repository:
- Initialize it with a README and a .gitignore (select the Python template).
- Choose a licence. For open science, we recommend MIT for code-heavy projects or CC-BY 4.0 for data/content-heavy projects.
- Clone the repository to your local machine:
bash
cd ~/projects
git clone https://github.com/YOUR-USERNAME/project-name.git
cd project-name
Step 2: Set Up the Directory Structure
Create the following folder structure inside your repository:
project-name/
├── code/ # reusable Python modules (your installable package)
│ └── __init__.py
├── data/
│ ├── raw/ # raw, untouched data (never modify these files)
│ └── processed/ # cleaned or transformed data
├── docs/
│ ├── labnotebook/ # electronic lab notes (Markdown files, dated)
│ └── protocols/ # experimental protocols and SOPs
├── results/
│ ├── figures/ # publication-quality figures
│ └── intermediate/ # checkpoints, intermediate outputs
├── scripts/ # analysis scripts and notebooks
├── outputs/
│ ├── papers/ # manuscript drafts (LaTeX or Markdown source)
│ └── posters/ # poster source files
├── tests/ # unit tests for your code
├── environment.yml # conda environment specification
├── setup.py # makes your code pip-installable
└── README.md # project overview and instructions
You can create this in one command:
bash
mkdir -p code data/{raw,processed} docs/{labnotebook,protocols} \ results/{figures,intermediate} scripts outputs/{papers,posters} tests
touch code/__init__.py
A few things to note about this structure. The code/ directory holds reusable Python modules that you import (equivalent to src/ in the Good Research Code Handbook). The scripts/ directory holds analysis scripts and Jupyter notebooks that call functions from code/. The data/raw/ directory is sacred: raw data go in, but nothing ever comes back out modified. All transformations produce new files in data/processed/.
Step 3: Set Up a Virtual Environment
Every project gets its own conda environment. This ensures that your dependencies are documented and that the project can be reproduced on any machine.
bash
conda create --name project-name python=3.11
conda activate project-name
conda install numpy scipy matplotlib pandas seaborn jupyter
Export the environment specification and commit it:
bash
conda env export > environment.yml
git add environment.yml
git commit -m "Add conda environment specification"
Keep environment.yml up to date as you add packages. Anyone can recreate your environment with:
bash
conda env create --file environment.yml
For more details on managing conda environments (including mixing pip and conda packages), see the Good Research Code Handbook: Setup.
Step 4: Make Your Code Pip-Installable
This step avoids the mess of sys.path hacks and makes your modules importable from anywhere in the project. Create a minimal setup.py in the project root:
python
from setuptools import find_packages, setup
setup(
name='project-name',
packages=find_packages(),
)
Then install your package in editable mode:
bash
pip install -e .
Now you can import code.my_module' from any script or notebook in the project without worrying about paths. If you change the code, the changes are picked up automatically. See the Good Research Code Handbook: Setup for a more detailed walkthrough.
Step 5: Configure .gitignore and Data Tracking
Not everything belongs in Git. Add the following to your .gitignore:
# Data (tracked separately, see below)
data/
results/intermediate/
# Python
*.egg-info/
__pycache__/
*.pyc/
.ipynb_checkpoints/
# OS files
.DS_Store
Thumbs.db
# Environment
.env
For large or sensitive data, Git is not the right tool. You have two main options depending on your situation:
- DataLad (recommended for neuroscience): built on Git and git-annex, it version-controls arbitrarily large files and integrates with BIDS and OpenNeuro. See the DataLad Handbook for a thorough tutorial.
- DVC (Data Version Control): a lightweight alternative that works well with remote storage backends (S3, Google Drive, etc.).
If your data are small enough to share directly (< 50 MB of text-based files, e.g., behavioural CSVs), you can keep them in the Git repository and remove data/ from .gitignore.
Step 6: Organize Your Data
Follow community standards for data organization wherever possible. For neuroimaging data (MRI, EEG, MEG), use the Brain Imaging Data Structure (BIDS). For behavioural and psychophysics data, adopt a consistent naming convention:
data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_beh.csv
data/raw/sub-01/ses-01/sub-01_ses-01_task-pursuit_eyetrack.edf
The key principles are: use subject and session identifiers consistently, include the task name, separate metadata from data, and never modify raw files. Write a data/README.md that documents the naming convention, variable definitions, and any relevant acquisition parameters.
Step 7: Keep an Electronic Lab Notebook
Use the docs/labnotebook/ directory for dated Markdown entries. A simple naming convention works well:
docs/labnotebook/2026-03-30_pilot-data-collection.md
docs/labnotebook/2026-04-02_initial-analysis.md
Each entry should briefly note what you did, what you observed, any decisions you made, and links to relevant scripts or results. These notes are version-controlled along with everything else and provide a timestamped record of the project's evolution.
You can also use tools like Obsidian or Logseq for richer note-taking and link them to your repository.
Step 8: Write a Good README
Your README.md is the front door to the project. Write it early and update it as the project evolves. It should include:
- Project title and one-paragraph summary of the scientific question.
- How to set up the environment (conda env create --file environment.yml).
- How to reproduce the results (which scripts to run, in what order).
- Directory structure (copy and paste the tree from Step 2 and annotate it).
- Data availability (where the data live if not in the repository).
- Authors and contact information.
- Licence.
Step 9: Commit Early, Commit Often
A good rule of thumb is to commit every meaningful unit of work: a new analysis function, a cleaned dataset, a draft of a figure. Each commit should have a short, informative message. Aim for several commits per day when you are actively working.
bash
git add scripts/01_preprocess.py
git commit -m "Add preprocessing pipeline for eye-tracking data"
git push
If you are not comfortable with the Git command line, the Git panel in VS Code is an excellent GUI alternative.
Step 10: Prepare for Publication from Day One
The reason we set all of this up at the start is so that sharing is effortless when the paper is ready. At publication time, you should be able to:
1. Make the GitHub repository public (or archive it on Zenodo for a citable DOI). 2. Deposit the data on a public repository such as OpenNeuro (for BIDS neuroimaging data), OSF, Figshare, or Dryad. 3. Link everything in the paper: point readers to the code repository, the data repository, and specify the exact environment needed to reproduce the results.
If you have followed this guide, all three steps should take minutes rather than days.
Quick-Reference Checklist
When starting a new project, work through the following:
- [ ] Create a GitHub repository with README, licence, and .gitignore
- [ ] Clone it locally and set up the directory structure
- [ ] Create and export a conda environment
- [ ] Create setup.py and pip install -e .
- [ ] Configure .gitignore (and DataLad/DVC if needed for large data)
- [ ] Write a data/README.md documenting your data conventions
- [ ] Start your lab notebook with a first entry
- [ ] Write a draft README.md with setup and reproduction instructions
- [ ] Make your first commit and push
Further Reading
- Mineault, P. J. (2021). The Good Research Code Handbook. The essential guide to writing clean, maintainable research code.
- Wilson, G. et al. (2017). Good enough practices in scientific computing. PLOS Computational Biology.
- Gorgolewski, K. J. et al. (2016). The Brain Imaging Data Structure. Scientific Data.
- Halchenko, Y. O. et al. (2021). DataLad: distributed system for joint management of code, data, and their relationship. Journal of Open Source Software.
- The DataLad Handbook for a comprehensive tutorial on data version control.
- The Software Carpentry lessons for foundational skills in the Unix shell, Git, and Python.