BAA1028 - Workflow & Data Management
Available on the module’s loop page and, more importantly, online at:
Website that showcases your coding skills, achievements, experiences, and personal reflections. It should include various media types—documents, images, videos, audio clips, presentations, and web links—that can be tailored for different audiences. Deadline: April 13, 2025
…Just to name a few.
These services offer a platform to design an eportfolio/website and to host the eportfolio/website on their own servers.
Instead, we are going to manually create our eportfolio/website and to host it on a specific server.
Overall design, attention to details, and innovation.
NOT
copy other students workNOT
copy code online without referenceNOT
use full code templates05:00
Which file extensions
have you found?
Web and Interactive Content
• .html / .htm: HyperText Markup Language files for web content.
• .css: Cascading Style Sheets for web design.
• .js: JavaScript files for interactive elements.
Text and Document Formats
• .doc / .docx: Microsoft Word documents.
• .pdf: Portable Document Format, widely used for static, professional documents.
• .txt: Plain text files.
• .rtf: Rich Text Format, compatible with various word processors.
• .odt: OpenDocument Text, used by open-source tools like LibreOffice.
Presentation Formats
• .ppt / .pptx: Microsoft PowerPoint presentations.
• .odp: OpenDocument Presentation format.
• .key: Apple Keynote presentations.
Spreadsheet Formats
• .xls / .xlsx: Microsoft Excel spreadsheets.
• .csv: Comma-separated values, often used for data sharing.
• .ods: OpenDocument Spreadsheet format.
Image Formats
• .jpg / .jpeg: Compressed image files.
• .png: High-quality image files supporting transparency.
• .gif: Animated or static image files.
• .bmp: Bitmap image files.
• .tiff / .tif: High-resolution image files.
• .webp: Compressed image format for the web.
• .ico: Icon image files, often used for branding or UI mockups.
• .heic: High Efficiency Image Format, used in modern Apple devices.
• .svgz: Compressed Scalable Vector Graphics.
• .raw: Camera raw image files from DSLRs.
Audio Formats
• .mp3: Compressed audio files.
• .wav: High-quality audio files.
• .ogg: Open-source audio file format.
Video Formats
• .mp4: A widely-used video format compatible with most devices.
• .avi: Video format with higher quality, but larger file size.
• .mov: Apple QuickTime video format.
• .wmv: Windows Media Video format.
Other Formats
• .svg: Scalable Vector Graphics, ideal for logos and illustrations.
• .md: Markdown files, often used in coding or minimalist documentation.
• .log: Log files for tracking progress or issues.
• .yml / .yaml: Data serialisation formats, often for configuration files.
• .zip / .tar.gz / .7z: Compressed archive formats for distributing multiple files.
• .dat: Generic data files.
Keep your projects tidy: 1 folder per projects and all the projects folders in 1 “project” folder.
05:00
Why do we care about project management?
The ability to move the project without breaking code or needing adapting
The ability to rerun the entire process from scratch
A path is a string of characters used to uniquely identify a location in a directory structure. It is composed by following the directory tree hierarchy in which components, separated by a delimiting character, represent each directory.
The delimiting character is most commonly the slash (“/”), the backslash character (“\”), or colon (“:”), though some operating systems may use a different delimiter.
Resources can be represented by either absolute or relative paths
Delimiting characters /
or \
vary by operating system
:
to specify the drive name (e.g., c:
, d:
, e:
)\
)J:\Work\PARI\PARI-F\data\pari-f_data_v0-1.dta
:
)/
) character/E/syncwork/projects/confer/ps2021-10-ws-repro-research
Software, such as R, \(\LaTeX\), and Python, with a Linux/UNIX background behaves differently under MS Windows when it comes to specifying file/folder paths
That is, when specifying a path (e.g., in R or \(\LaTeX\)) in MS Windows, these programs do not like the backslash character (\
) (the backslash is used for “escaping” other characters)
Two solutions in MS Windows:
/
instead of \
, e.g.:\\
, e.g.:Note: Many programming languages/statistical packages (R, Python, …) can dynamically create a full path that follows the rules of the respective operating system
returns: 'e:folder1\\folder2\\file'
(in Windows)
A handy tool when working on both operating system: Path Copy Copy – Copy file paths from Windows explorer’s contextual menu
Mac users can left-click and press option to “copy as a Pathname”
An absolute path specifies a file or directory location from the root directory.
Examples:
A relative path specifies a location relative to the current directory which is a “fixed location” on your computer
Often, this “fixed location” is the so-called “working directory”
.
denotes the current working directory..
denotes the parent directory, i.e., it points upwards in the folder hierarchy~
will bring you back to your home directory, e.g. cd ~
Examples:
subfolder/file.txt # Inside a subfolder
./file.txt # Current directory
../file.txt # Parent directory
Works differently based on where the command is run.
So, let’s assume the project “PARI-F” is located on drive J:
, the full absolute path is J:\Work\PARI\PARI-F
PARI-F
is:.
|-- analysis
|-- data
|-- doc
|-- pari-f.stpr
`-- report
All other file- or folder-related operations are defined relative to this working directory
The huge benefit: when you share your project with a colleague or move it to a new computer, you only have to define the working directory once, everything else should work flawelessly
How to define a working directory?
import os; os.chdir("full-path-to-working-directory")
How to get information about the current working directory?
import os; os.getcwd()
(cwd = current working directory); see below for an example
From a terminal/command line:
Mac/Linux:
Windows (Git Bash or WSL):
Windows (Command Prompt):
Moving between directories:
Special notations:
.
(current directory)..
(parent directory)Moving into a subdirectory:
Moving up one level:
Accessing a file in a parent directory:
See here for more examples
In your code, do not use:
Prefer:
#| eval: false
# pip
python -m pip install pyprojroot
# conda
conda install -c conda-forge pyprojroot
Then:
What’s wrong with os.chdir('/path/to/your/directory')
?
It will only ever work for the user creating the file
It is not portable
Increases likelihood that work from other processes leaks into current work
The pyprojroot
library:
If all files are contained in the project folder reference files with the here() function from the pyprojroot
library creates relative paths from project root allows several ways to indicate project root folder
Contains all necessary files for your project, eportfolio or any repository in general:
data
results
docs
src
or py
scripts/analysis
README.md
LICENCE
Organising files in data/, results/, docs/, and scripts/ require some ideas of how to name files for:
If you are using the py/ folder to store python-functions, these might need somewhat different naming conventions than the other folders, as these are functions you can use across the other files.
Here, naming should be particularly thought in terms of content rather than structural organisation.
An important part of project management, code automation, and data analytics in general is to have your files read by a piece of code or software.
Machines are clever, but extremely pedantic.
Be consistent, be meticulous.
Some machines are more clever than others, so name files in a way that the “dumbest” of them can deal with.
Naming - variables and filenames should have meaningful names in snake_case
format, preferring all lower case.
Machines will first list files starting with numbers (ascdendingly) then in alphabetic order.
But they wont understand the difference between 1 and 10
Using dates in file names may also ensure decent organisation but be consistent. Recommend using YYYY-MM-DD formatting
vs
Consider using different space separators for different parts of the file name
This way you can use the file name it self, programatically, if needed
Optimising file names for computers is great, but ultimately its us humans that need to choose files to work with. Naming files in a way that makes the file content obvious (or at least give an idea of content) by the file name is good for such interactions.
vs.
Images from plots should use png or svg
.png supports transparency and has no quality loss upon re-saving
.svg can rescale to infinity without getting grainy
.jpg best for photos, quality loss on rescale, blurry edges and poor text rendering
Images can also some times be saved in pdf, but pdf while a vector format, cannot support transparency.
Tiff has fallen out of favour due to high file sizes, but are preferable to jpeg for photos.
Huge thanks the following people who have generated and shared most of the content of this lecture:
Athanasia Monika Mowinckel: Mind your data - Creating Organised Research Projects
Bernd Weiß: Tools and Workflows for Reproducible Research in the Quantitative Social Sciences - Computer Literacy
Thanks for your attention and don’t hesitate to ask if you have any questions!
@damien_dupre
@damien-dupre
https://damien-dupre.github.io
damien.dupre@dcu.ie