02:00
BAA1028 - Workflow & Data Management
https://damien-dupre.github.io/BAA1028/lecture_3
Note: Your user name will become extremely important in your future, firstname-name
is usually good
02:00
Follow the steps here after to create a Repository:
+
drop-down menu, and select New repository.hello-world
.My first repository on GitHub
.Choose a PUBLIC repository visibility. For more information, see about repositories,
Tick ✅ Add a README file,
Click Create repository.
In GitHub, a commit is a saved change to a project’s source code or other files. When you make changes to a file in a GitHub repository, you create a new version of that file.
A commit contains a snapshot of the changes you’ve made to one or more files, along with a message that describes the changes. This message should be descriptive and clear, so that other developers can understand what changes you’ve made and why.
When you created your new repository, you initialized it with a README file. README files are a great place to describe your project in more detail, or add some documentation such as how to install or use your project. The contents of your README file are automatically shown on the front page of your repository.
Follow the steps here after to commit a change to the README file.
Below the commit message fields, decide whether to add your commit to the current branch or to a new branch. Select commit directly to the main
branch for now.
Click Commit changes.
⚠️ Warning: For collaborative projects never commit to the main branch
In your Repository Page in GitHub, Click Add files then on Upload files,
Drop or choose all the files necessary to your transformed template website,
In the main box and commit your changes
02:00
GitHub Pages is a web hosting service offered by GitHub that allows you to host static websites directly from a GitHub repository. This means you can use GitHub to store and version control your website’s code, and then host it for free using GitHub Pages.
Your website will then be published at a URL based on your GitHub username and repository name (e.g., username.github.io/repository).
Turn on GitHub Pages for your project repository:
Go to Settings and find Pages on the left pane,
In Branch, instead of None select Main and click Save,
Click on ActionsActions and wait that “pages build and deployment” finishes,
When it’s done, go to https://username.github.io/repository/nameofyourfile.html.
03:00
Why do we care about project management?
The ability to move the project without breaking code or needing adapting
The ability to rerun the entire process from scratch
In your code, do not use:
Prefer:
#| eval: false
# pip
python -m pip install pyprojroot
# conda
conda install -c conda-forge pyprojroot
Then:
What’s wrong with os.chdir('/path/to/your/directory')
?
It will only ever work for the user creating the file
It is not portable
Increases likelihood that work from other processes leaks into current work
The pyprojroot
library:
If all files are contained in the project folder reference files with the here() function from the pyprojroot
library creates relative paths from project root allows several ways to indicate project root folder
Contains all necessary files for your project, eportfolio or any repository in general:
data
results
docs
src
or py
scripts/analysis
README.md
LICENCE
Organising files in data/, results/, docs/, and scripts/ require some ideas of how to name files for:
If you are using the py/ folder to store python-functions, these might need somewhat different naming conventions than the other folders, as these are functions you can use across the other files.
Here, naming should be particularly thought in terms of content rather than structural organisation.
An important part of project management, code automation, and data analytics in general is to have your files read by a piece of code or software.
Machines are clever, but extremely pedantic.
Be consistent, be meticulous.
Some machines are more clever than others, so name files in a way that the “dumbest” of them can deal with.
Naming - variables and filenames should have meaningful names in snake_case
format, preferring all lower case.
Machines will first list files starting with numbers (ascdendingly) then in alphabetic order.
But they wont understand the difference between 1 and 10
Using dates in file names may also ensure decent organisation but be consistent. Recommend using YYYY-MM-DD formatting
vs
Consider using different space separators for different parts of the file name
This way you can use the file name it self, programatically, if needed
Optimising file names for computers is great, but ultimately its us humans that need to choose files to work with. Naming files in a way that makes the file content obvious (or at least give an idea of content) by the file name is good for such interactions.
vs.
Images from plots should use png or svg
.png supports transparency and has no quality loss upon re-saving
.svg can rescale to infinity without getting grainy
.jpg best for photos, quality loss on rescale, blurry edges and poor text rendering
Images can also some times be saved in pdf, but pdf while a vector format, cannot support transparency.
Tiff has fallen out of favour due to high file sizes, but are preferable to jpeg for photos.
A path is a string of characters used to uniquely identify a location in a directory structure. It is composed by following the directory tree hierarchy in which components, separated by a delimiting character, represent each directory.
The delimiting character is most commonly the slash (“/”), the backslash character (“\”), or colon (“:”), though some operating systems may use a different delimiter.
Resources can be represented by either absolute or relative paths
Delimiting characters /
or \
vary by operating system
:
to specify the drive name (e.g., c:
, d:
, e:
)\
)J:\Work\PARI\PARI-F\data\pari-f_data_v0-1.dta
:
)/
) character/E/syncwork/projects/confer/ps2021-10-ws-repro-research
Software, such as R, \(\LaTeX\), and Python, with a Linux/UNIX background behaves differently under MS Windows when it comes to specifying file/folder paths
That is, when specifying a path (e.g., in R or \(\LaTeX\)) in MS Windows, these programs do not like the backslash character (\
) (the backslash is used for “escaping” other characters)
Two solutions in MS Windows:
/
instead of \
, e.g.:\\
, e.g.:Note: Many programming languages/statistical packages (R, Python, …) can dynamically create a full path that follows the rules of the respective operating system
returns: 'e:folder1\\folder2\\file'
(in Windows)
A handy tool when working on both operating system: Path Copy Copy – Copy file paths from Windows explorer’s contextual menu
Mac users can left-click and press option to “copy as a Pathname”
An absolute path specifies a file or directory location from the root directory.
Examples:
A relative path specifies a location relative to the current directory which is a “fixed location” on your computer
Often, this “fixed location” is the so-called “working directory”
.
denotes the current working directory..
denotes the parent directory, i.e., it points upwards in the folder hierarchy~
will bring you back to your home directory, e.g. cd ~
Examples:
subfolder/file.txt # Inside a subfolder
./file.txt # Current directory
../file.txt # Parent directory
Works differently based on where the command is run.
So, let’s assume the project “PARI-F” is located on drive J:
, the full absolute path is J:\Work\PARI\PARI-F
PARI-F
is:.
|-- analysis
|-- data
|-- doc
|-- pari-f.stpr
`-- report
All other file- or folder-related operations are defined relative to this working directory
The huge benefit: when you share your project with a colleague or move it to a new computer, you only have to define the working directory once, everything else should work flawelessly
How to define a working directory?
import os; os.chdir("full-path-to-working-directory")
How to get information about the current working directory?
import os; os.getcwd()
(cwd = current working directory); see below for an example
From a terminal/command line:
Mac/Linux:
Windows (Git Bash or WSL):
Windows (Command Prompt):
Moving between directories:
Special notations:
.
(current directory)..
(parent directory)Moving into a subdirectory:
Moving up one level:
Accessing a file in a parent directory:
See here for more examples
Huge thanks the following people who have generated and shared most of the content of this lecture:
Athanasia Monika Mowinckel: Mind your data - Creating Organised Research Projects
Bernd Weiß: Tools and Workflows for Reproducible Research in the Quantitative Social Sciences - Computer Literacy
Thanks for your attention and don’t hesitate to ask if you have any questions!
@damien_dupre
@damien-dupre
https://damien-dupre.github.io
damien.dupre@dcu.ie