Jan Langfellner – Data Science Project Portfolio

I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.

Portfolio

This data science project portfolio contains 14 projects that together cover the entire data science workflow — from acquiring data to reflecting upon the outcome. The projects apply Python and SQL to solve problems; the packages/tools that are used encompass NumPy, pandas, PostgreSQL, scikit-learn, matplotlib, Jupyter, and others.

Projects: End-to-end

EU industry production — From an online dataset to a visualization of key trends

From the European Union Open Data Portal, I obtain data on the industry production growth in each EU member state and analyze which countries outperform the others.

Here is a link to the summary of the end-to-end project. Below you can find the individual projects for the single data science steps.

Projects: Individual data science steps

The end-to-end project is broken down into smaller projects that each focus on a single step of the entire data science process:

Credit: based on “R for Data Science” by Hadley Wickham, Garrett Grolemund and “Data Science Portfolio Guide” by Sebastian Gutierrez

Below is a list of these projects, in the order of the data science steps. Clicking on a project name brings you to the project write-up:

Import

Import data from original source or local data store

Import from source

Import data from an original source (data repository, API, etc.)

Automated data retrieval from an online repository to harness free data sources

I obtain monthly EU industry production data from the European Union Open Data Portal through an API.

Store

Store imported data in a data store (e.g., SQL database)

Storing a pandas dataframe in a PostgreSQL database

I store EU industry production data in a PostgreSQL database using the SQLAlchemy package.

Extract

Extract whole dataset(s) or specific parts from a data store (e.g., SQL database) and load into Python

Using SQL queries to extract data from a PostgreSQL database

Making use of SQLAlchemy and SQL queries, I extract EU industry production data for further analysis from the PostgreSQL database where I previously stored it.

Tidy

Reorganizing, combining and cleaning data so that it is ready for analysis

Reorganize

Reorganize and/or combine dataset(s) in a tidy form for analysis/visualization/modeling

Bringing an EU industry production dataframe into good shape

I use Pandas dataframe methods to bring EU industry production data into a tidy format to facilitate further analysis.

Clean

Treat data values: Identify missing data, fix data types, split values etc.

Making the numbers shine: Cleaning EU industry production index values

I make EU industry production index values, which I previously put in a tidy form, ready for analysis by splitting numbers and flag values with pandas methods.

Understand

Make sense of data with (iteratively using) statistics, visualization, data transformation and modeling

Visualize

Show trends and statistics of data in compact, human-friendly way

Exploring the industry production history with EDA

I use statistical and graphical tools to perform exploratory data analysis (EDA) on the EU industry production dataset as a starting point for modeling the time series.

Different countries’ growth dynamics at a glance with bar charts

I use pandas’ interface to the matplotlib library to create bar charts that visualize the manufacturing growth dynamics of European countries.

Transform

Prepare data for modeling: Feature engineering, value imputation etc.

Removing common trends from a set of time series to highlight their differences

I divide the EU industry production index time series for each country by the smoothed EU average time series to bring out the countries’ individual development for further modeling.

Model

Experiment design: Selecting model, metric and algorithm

Reducing complexity – from a time series to a single number: modeling

I select a linear model with slope and intercept parameters to describe the growth dynamics of the EU industry production index of each country.

Coding

Execution, iteration and finalization of modeling algorithm

Reducing complexity – from a time series to a single number: coding

Using the linear model from Python’s scikit-learn package, I obtain the slopes in the EU industry production time series for each country.

Interpret

Interpret results and draw conclusions

Spotting trends in the manufacturing growth dynamics: Which region grew the fastest?

I use the visualization of the EU countries’ manufacturing growth rate with a pandas/matplotlib bar chart to show that the performance mostly depends on geographical position: the East beats the South.

Communicate

Report project results, suggest next steps, document code

Report

Summarize and discuss project
What was the approach, what worked, what didn’t work, what assumptions were made, what would I do different, why do insights matter

EU industry production: From an online dataset to a visualization of key trends

I give a high-level description and discussion of the projects in this portfolio that deal with the analysis of the EU industry production dataset, from finding the data to possible next steps.

Next Steps

Possible next projects that build on/complement current project

Where to go from here: A more comprehensive study of economic growth

I discuss the scope of the analysis of the EU industry production dataset and point to possible extensions with additional datasets.

Document

Code and steps that are necessary for replication

Software and Jupyter notebooks for the industry production analysis

I specify the software versions that I used for the analysis of the EU industry production dataset and provide links to the Github repository where the Jupyter notebooks are stored.

Contact details

Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/

PhD thesis

Impressum