Reducing complexity – from a time series to a single number: coding

TL;DR

Using the linear model from Python’s scikit-learn package, I obtain the slopes in the EU industry production time series for each country.

Long Description

I prepare the normalized EU industry production index dataset for the fit routine of the scikit-learn linear model by forcing the time stamps into a 2D numpy array and the production indices into one 1D array per country. For each country, I perform the linear regression with the fit method and store both the intercept and slope parameters of the fit lines in 1D numpy arrays. I also obtain the slope by simply subtracting the production index at the beginning of the time series from its value at the end and dividing by the time span. Both slope estimates and the intercept I store together with the country_codes in a new dataframe, simplifying further analysis.

Table of contents

Project Background

After having defined the model to describe the EU industry production index time series in the previous project, I now fit the model with the help of the scikit-learn package.

While the package offers many different generalized linear models, I opt to remain with standard linear regression, given the simplicity of the problem with only one dependent variable.

Data preparation

Inspecting the data

I start by reading in the dataframe that I previously prepared for this project:

import pandas as pd
df = pd.read_pickle('EU_industry_production_dataframe_forregression.pkl')
df.info()
print(df.head())
<class 'pandas.core.frame.DataFrame'>
Float64Index: 212 entries, -10.0013689254 to 7.58110882957
Data columns (total 36 columns):
AT      211 non-null float64
BA      139 non-null float64
BE      211 non-null float64
BG      211 non-null float64
CY      211 non-null float64
CZ      211 non-null float64
DE      211 non-null float64
DK      211 non-null float64
EA19    211 non-null float64
EE      211 non-null float64
EL      211 non-null float64
ES      211 non-null float64
EU28    211 non-null float64
FI      211 non-null float64
FR      211 non-null float64
HR      211 non-null float64
HU      211 non-null float64
IE      211 non-null float64
IT      211 non-null float64
LT      211 non-null float64
LU      211 non-null float64
LV      211 non-null float64
ME      91 non-null float64
MK      211 non-null float64
MT      211 non-null float64
NL      211 non-null float64
NO      211 non-null float64
PL      211 non-null float64
PT      211 non-null float64
RO      211 non-null float64
RS      211 non-null float64
SE      211 non-null float64
SI      211 non-null float64
SK      211 non-null float64
TR      211 non-null float64
UK      211 non-null float64
dtypes: float64(36)
memory usage: 61.3 KB

country_code        AT  BA        BE        BG        CY        CZ        DE  \
time                                                                           
-10.001369    0.763119 NaN  0.659628  0.638720  1.076730  0.609450  0.882291   
-9.916496     0.804837 NaN  0.675563  0.650542  1.094662  0.628649  0.900751   
-9.837098     0.799001 NaN  0.682480  0.650229  1.071577  0.636704  0.900957   
-9.752225     0.806887 NaN  0.703174  0.646132  1.076540  0.650280  0.909562   
-9.670089     0.819436 NaN  0.705168  0.643401  1.054149  0.660902  0.923410   

country_code        DK      EA19        EE    ...           NO        PL  \
time                                          ...                          
-10.001369    1.108091  1.015053  0.577044    ...     0.993101  0.510140   
-9.916496     1.097790  1.027940  0.606756    ...     0.988324  0.521268   
-9.837098     1.114232  1.029963  0.587807    ...     0.989388  0.516022   
-9.752225     1.093134  1.036092  0.602572    ...     0.984236  0.533084   
-9.670089     1.179741  1.043854  0.593988    ...     0.957381  0.542516   

country_code        PT        RO        RS        SE        SI        SK  \
time                                                                       
-10.001369    1.266987  0.752666  0.963830  0.950240  0.822705  0.461008   
-9.916496     1.256255  0.753753  1.083194  0.987281  0.845496  0.473311   
-9.837098     1.190179  0.771952  1.087183  0.995630  0.857262  0.487932   
-9.752225     1.176105  0.781995  1.097283  1.015350  0.854594  0.490562   
-9.670089     1.220918  0.779288  1.089150  1.019148  0.866790  0.456043   

country_code        TR        UK  
time                              
-10.001369    0.614677  1.116454  
-9.916496     0.631776  1.129066  
-9.837098     0.616937  1.125676  
-9.752225     0.643020  1.116988  
-9.670089     0.644431  1.113856  

[5 rows x 36 columns]

Note that Bosnia and Herzegovina ('BA') and Montenegro ('ME') have many missing values, since their time series do not start in 2000, but later.

Dropping NaN values

The fitting routine for the linear model in scikit-learn does not accept NaN values though, so I need to account for that. Because only two countries are affected (which are not even in the EU), I simply drop them:

df.drop(['BA','ME'], axis=1, inplace=True)

In addition, all the other time series have only 211 non-null entries out of 212 values in total. These single NaN entries are the latest data point for each country. I drop them as well and get nice NaN-free time series:

df.drop(df.index[-1], inplace=True)
print(df.info())
<class 'pandas.core.frame.DataFrame'>
Float64Index: 211 entries, -10.0013689254 to 7.49623545517
Data columns (total 34 columns):
AT      211 non-null float64
BE      211 non-null float64
BG      211 non-null float64
CY      211 non-null float64
CZ      211 non-null float64
DE      211 non-null float64
DK      211 non-null float64
EA19    211 non-null float64
EE      211 non-null float64
EL      211 non-null float64
ES      211 non-null float64
EU28    211 non-null float64
FI      211 non-null float64
FR      211 non-null float64
HR      211 non-null float64
HU      211 non-null float64
IE      211 non-null float64
IT      211 non-null float64
LT      211 non-null float64
LU      211 non-null float64
LV      211 non-null float64
MK      211 non-null float64
MT      211 non-null float64
NL      211 non-null float64
NO      211 non-null float64
PL      211 non-null float64
PT      211 non-null float64
RO      211 non-null float64
RS      211 non-null float64
SE      211 non-null float64
SI      211 non-null float64
SK      211 non-null float64
TR      211 non-null float64
UK      211 non-null float64
dtypes: float64(34)
memory usage: 57.7 KB
None

Reshaping the dataset for scikit-learn

Next, I import the linear_model class from scikit-learn:

from sklearn import linear_model

Before actually performing the linear regression, I need to bring the dataset in a form that is recognized by scikit-learn. The fitting method takes two parameters: the input X and the output y.

X can be identified with the dependent variables, of which there is just one in this case, namely the time.

y is the outcome, that is the normalized industry production index.

The input X has to be a 2D array and is the same for all countries, so I assign it right away:

X = df.index.values            # Get numpy array of index (time stamps in years)
print(X.shape)

X = X.reshape(-1, 1)           # Transform numpy array from 1D to 2D
print(X.shape)
(211,)
(211, 1)

The output y is different for each country, so I will loop over all countries and perform a separate linear regression for each one.

I can get the country codes as a list by extracting all the column names from the dataframe:

country_list = df.columns.values   # Get list of all country codes
print(country_list)
['AT' 'BE' 'BG' 'CY' 'CZ' 'DE' 'DK' 'EA19' 'EE' 'EL' 'ES' 'EU28' 'FI' 'FR'
 'HR' 'HU' 'IE' 'IT' 'LT' 'LU' 'LV' 'MK' 'MT' 'NL' 'NO' 'PL' 'PT' 'RO' 'RS'
 'SE' 'SI' 'SK' 'TR' 'UK']

Performing the linear regression

Now I loop over all countries.

In each step, I assign the normalized industry production values to y, perform the linear regression, and store the slope and intercept values in numpy arrays:

import numpy as np

slope = np.empty(len(country_list))         # Create empty array for slopes from linear regression
intercept = np.empty(len(country_list))     # Create empty array for intercept from linear regression

# Loop over all countries:
for i, country in enumerate(country_list):
    y = df[country].values                  # Get industry production index values for selected country
    regr = linear_model.LinearRegression()  # Instantiate linear regression object
    regr.fit(X, y)                          # Perform linear regression (fit straight line)
    slope[i] = regr.coef_[0]                # Store slope in array (first and only coefficient)
    intercept[i] = regr.intercept_          # Store intercept in array

Let’s have a look at the parameter values that I obtained from the linear regression:

print(slope)
print(intercept)
[  1.69755062e-02   2.27397515e-02   2.33266960e-02  -2.73036128e-02
   3.01791696e-02   1.10648741e-02  -2.54430214e-03  -2.06856083e-03
   4.36360429e-02  -3.08404991e-02  -2.61402287e-02   3.36143245e-05
  -3.74731167e-03  -1.69121515e-02   8.05852423e-04   2.80566577e-02
   3.92839139e-02  -2.34791885e-02   3.66338076e-02  -6.37034263e-03
   2.77396054e-02   1.52164653e-02  -8.32050455e-03   5.89524088e-03
   6.97164004e-03   4.64402255e-02  -2.25316217e-02   3.60324141e-02
  -6.70048827e-04  -9.21470009e-03   9.93230550e-03   5.91275559e-02
   3.89830067e-02  -7.94040227e-03]

[ 0.97951101  0.93506694  1.00117521  0.90635968  0.98434731  0.98850886
  1.07562782  1.00401122  1.06813739  1.01401165  1.03302358  1.00149184
  0.97096251  1.03319856  0.95866234  0.98838774  1.04417995  1.00206516
  1.02512619  0.99293733  1.06671794  1.07534489  0.9701639   0.96239976
  0.96680512  0.92858528  1.04249824  1.02776252  1.01763429  0.974905
  0.97854261  0.99558271  0.98038008  1.00006333]

The slope values range between plus/minus a few percent per year, which appears sensible (judging from the plots in the previous project).

It also makes sense that the intercept values are close to one, which is the reference value for zero time (the year 2010).

An alternative measure for the slope

As a simple measure to check for the robustness of the slope values, I also calculate the slope using an alternative method, namely the difference between the last and the first production index values of the time series, divided by the length of the time series in years.

To reduce the influence of the short-time fluctuations, I average over the five first and last values instead of just taking one value each:

slope_alt = np.empty(len(country_list))     # Create empty array for slopes from difference end of time series minus beginning

# Loop over all countries:
for i, country in enumerate(country_list):
    y = df[country].values                  # Get industry production index values for selected country
    slope_alt[i] = (y[-5:].mean() - y[:5].mean())/(X[-1,0] - X[0,0])      # Store slope in array

Let’s also print the alternative slope values:

print(slope_alt)
[ 0.01540618  0.0203894   0.02858958 -0.01814415  0.03178366  0.00758084
  0.00022828 -0.00215899  0.04310238 -0.02575114 -0.02014504 -0.00021196
 -0.0016687  -0.01616444  0.00645503  0.0299987   0.04527303 -0.02057385
  0.03709763 -0.00015542  0.03415784  0.00874983 -0.0112554   0.0037613
 -0.00294682  0.04093304 -0.01623621  0.03449866  0.00147644 -0.00471612
  0.01575481  0.05198007  0.03306995 -0.01023881]

Storing the slopes in a new dataframe

Let’s now create a new dataframe containing the two slope estimates and the intercept from the linear regression:

# Put data in dictionary:
slope_dict = {'country_code':country_list, 'slope':slope, 'intercept':intercept, 'slope_alt':slope_alt}

# Construct dataframe from dictionary:
df_slopes = pd.DataFrame(slope_dict)

# Choose country_code column as index:
df_slopes.set_index('country_code', inplace=True)

print(df_slopes.info())
print(df_slopes.head())
<class 'pandas.core.frame.DataFrame'>
Index: 34 entries, AT to UK
Data columns (total 3 columns):
intercept    34 non-null float64
slope        34 non-null float64
slope_alt    34 non-null float64
dtypes: float64(3)
memory usage: 1.1+ KB
None

              intercept     slope  slope_alt
country_code                                
AT             0.979511  0.016976   0.015406
BE             0.935067  0.022740   0.020389
BG             1.001175  0.023327   0.028590
CY             0.906360 -0.027304  -0.018144
CZ             0.984347  0.030179   0.031784

This dataframe contains the slopes for each country. I will visually explore and analyze them in the next project.

Conclusion

I have prepared the normalized EU industry production index dataset as a set of NaN-free numpy arrays that can be used as input for the linear regression method of the scikit-learn package. Looping over all countries, I obtained both slope and intercept of the fit lines and stored them in numpy arrays. As an alternative way of obtaining the slope (that will be useful for assessing the robustness of the slopes), I have also computed the difference of the normalized production index between the end and the beginning of the datasets.

The slopes and intercept I have put together with the country codes in a new dataframe that I will use for visualization and further analysis.

Note that I needn’t throw away the (shorter) time series for Bosnia and Herzegovina as well as Montenegro. This was the simplest measure here and did not cost me much, but other datasets might contain many more missing values. To avoid throwing away data with NaN values, different strategies can be employed, e.g. imputing the mean or following values at single NaN values. In the case of the two mentioned countries in the EU industry production dataset, I could have restricted the linear regression to the shorter (NaN-free) time period to obtain slope and intercept values.

Code

The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.3 kernel and Anaconda 5.0.1.

The Jupyter notebook can be found on Github.

Bio

I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.

Contact details

Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/

PhD thesis

Impressum

One thought on “Reducing complexity – from a time series to a single number: coding

Leave a Reply