I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.
This data science project portfolio contains 14 projects that together cover the entire data science workflow — from acquiring data to reflecting upon the outcome. The projects apply Python and SQL to solve problems; the packages/tools that are used encompass NumPy, pandas, PostgreSQL, scikit-learn, matplotlib, Jupyter, and others.
EU industry production — From an online dataset to a visualization of key trends
From the European Union Open Data Portal, I obtain data on the industry production growth in each EU member state and analyze which countries outperform the others.
Here is a link to the summary of the end-to-end project. Below you can find the individual projects for the single data science steps.
Projects: Individual data science steps
The end-to-end project is broken down into smaller projects that each focus on a single step of the entire data science process:
Credit: based on “R for Data Science” by Hadley Wickham, Garrett Grolemund and “Data Science Portfolio Guide” by Sebastian Gutierrez
Below is a list of these projects, in the order of the data science steps. Clicking on a project name brings you to the project write-up:
Import data from original source or local data store
Import from source
Import data from an original source (data repository, API, etc.)
I obtain monthly EU industry production data from the European Union Open Data Portal through an API.
Store imported data in a data store (e.g., SQL database)
I store EU industry production data in a PostgreSQL database using the SQLAlchemy package.
Extract whole dataset(s) or specific parts from a data store (e.g., SQL database) and load into Python
Making use of SQLAlchemy and SQL queries, I extract EU industry production data for further analysis from the PostgreSQL database where I previously stored it.
Reorganizing, combining and cleaning data so that it is ready for analysis
Reorganize and/or combine dataset(s) in a tidy form for analysis/visualization/modeling
I use Pandas dataframe methods to bring EU industry production data into a tidy format to facilitate further analysis.
Treat data values: Identify missing data, fix data types, split values etc.
I make EU industry production index values, which I previously put in a tidy form, ready for analysis by splitting numbers and flag values with pandas methods.
Make sense of data with (iteratively using) statistics, visualization, data transformation and modeling
Show trends and statistics of data in compact, human-friendly way
I use statistical and graphical tools to perform exploratory data analysis (EDA) on the EU industry production dataset as a starting point for modeling the time series.
I use pandas’ interface to the matplotlib library to create bar charts that visualize the manufacturing growth dynamics of European countries.
Prepare data for modeling: Feature engineering, value imputation etc.
I divide the EU industry production index time series for each country by the smoothed EU average time series to bring out the countries’ individual development for further modeling.
Experiment design: Selecting model, metric and algorithm
I select a linear model with slope and intercept parameters to describe the growth dynamics of the EU industry production index of each country.
Execution, iteration and finalization of modeling algorithm
Using the linear model from Python’s scikit-learn package, I obtain the slopes in the EU industry production time series for each country.
Interpret results and draw conclusions
I use the visualization of the EU countries’ manufacturing growth rate with a pandas/matplotlib bar chart to show that the performance mostly depends on geographical position: the East beats the South.
Report project results, suggest next steps, document code
Summarize and discuss project
What was the approach, what worked, what didn’t work, what assumptions were made, what would I do different, why do insights matter
I give a high-level description and discussion of the projects in this portfolio that deal with the analysis of the EU industry production dataset, from finding the data to possible next steps.
Possible next projects that build on/complement current project
I discuss the scope of the analysis of the EU industry production dataset and point to possible extensions with additional datasets.
Code and steps that are necessary for replication
I specify the software versions that I used for the analysis of the EU industry production dataset and provide links to the Github repository where the Jupyter notebooks are stored.