This is the third part of our example showing how open data datasets from the Open Data portal of Vienna can be used to obtain insights about the city. For the analysis we work the open and freely available Jupyter Notebooks software. In this part, we link pet statistics with further open datasets concerning population and income, and analyze correlations within this data.

Open Data Story¶

In the previous tutorial we set out to explore the dog statistics in Vienna and discovered that there is a significant variation in the number of dogs across different districts. In this tutorial we explore additional data sets trying to explain the differences between Vienna districts that may influence the dog statistics. Here is a set of questions we have in mind:

Research Questions¶

Why some districts have more/less dogs than others?
Where should I live if I want to have a dog in Vienna?
Which dogs are in trend in the rich neighborhoods of Vienna?
What is the dog situation in my district (revisited)?

Get the Data¶

Let’s repeat the same step we did in the previous tutorials to load the dog stats dataset.

In [45]:

# Load libraries
import pandas as pd # CSV file processing
import numpy as np # vector and matrix manipulation

# Load the csv file from the open data portal
# dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_anzahlderhundeprobezirkderstadtwien/resource/b8d97349-c993-486d-b273-362e0524f98c
data_path = 'https://www.wien.gv.at/finanzen/ogd/hunde-wien.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data = pd.read_csv(data_path, delimiter=';', skiprows=1, thousands=',', encoding='latin-1')

# Correct individual values in the dataset
data.loc[1914, 'Anzahl'] = 1510
data.loc[5347, 'Anzahl'] = 2460

# Carefully select the string separator, including spaces!
separate_breeds = data['Dog Breed'].str.split(' / ', expand=True)
separate_breeds.columns = ["Breed_1", "Breed_2"]
data = pd.concat([data, separate_breeds], axis=1)

# Correct encoding for special characters in german alphabet
def to_utf(x):
    return x.encode('latin-1').decode('utf8') if isinstance(x, str) else x   
data = data.applymap(to_utf)

# Aggregate
data = data.groupby(['DISTRICT_CODE', 'Breed_1'])['Anzahl'].aggregate(np.sum).reset_index()
data.columns = ["DISTRICT_CODE", "Dog Breed", "Dog Count"]

# Check the top of the table to make sure the dataset is loaded correctly 
data.head()

Out[45]:

	DISTRICT_CODE	Dog Breed	Dog Count
0	90100	Afghanischer Windhund	1
1	90100	Amerikanischer Pit-Bullterrier	1
2	90100	Amerikanischer Staffordshire-Terrier	5
3	90100	Australian Shepherd Dog	3
4	90100	Australian Terrier	1

Get More Data¶

Districts of any city (town) are not homogeneous in terms of area and population. Also in Vienna some districts are bigger in terms of territory, more densely populated or wealthier than other district. These factors shall also have an effect on the dog population within the specified regions.

Now we need to fetch this kind of data (population, area and average income per district in Vienna) in order to check its correlation with our dog stats data. https://www.data.gv.at is the biggest Austrian open data portal. Let’s have a look there!

Population Statistics per District¶

Population (Bevölkerung) seems just right for the general category, filtering on Vienna, CSV data format and community districts (Gemeindebezirke) boils down the search results to a single dataset!

Next, we just repeat the same steps we used previously to obtain the dog stats dataset:

In [46]:

# Load the csv file from the open data portal
# Dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_viebevlkerungseit1869wien/resource/c9e9d0cc-8f71-4521-b598-17d71fab2f59
data_path = 'https://www.wien.gv.at/statistik/ogd/vie_101.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data_population = pd.read_csv(data_path, delimiter=';', skiprows=1)
# Check the top of the table to make sure the dataset is loaded correctly 
data_population.head()

Out[46]:

	NUTS1	NUTS2	NUTS3	DISTRICT_CODE	SUB_DISTRICT_CODE	POP_TOTAL	POP_MEN	POP_WOMEN	REF_DATE
0	AT1	AT13	AT130	90100	.	68079	.	.	18691231
1	AT1	AT13	AT130	90200	.	68442	.	.	18691231
2	AT1	AT13	AT130	90300	.	88678	.	.	18691231
3	AT1	AT13	AT130	90400	.	55682	.	.	18691231
4	AT1	AT13	AT130	90500	.	54010	.	.	18691231

The dataset title informs us that the dataset contains data dating from 1869 year. We would prefer more fresh population statistics to be able to compare with the dog stats that we have dating from 2016 12 01.

Unfortunately it is not clear from the dataset description which time periods it contains:
„VIE-Bevölkerung in den 23 Wiener Gemeindebezirken nach Geschlecht seit 1869
Zuletzt aktualisiert 31. März 2016
Erstellt 31. März 2016“

Let us examine the REF_DATE column, which seems to contain the reference date here.

In [47]:

# Check unique column values
data_population['REF_DATE'].unique()

Out[47]:

array([18691231, 18801231, 19001231, 19101231, 19230307, 19340322,
       19390517, 19510601, 19610321, 19710512, 19810512, 19910515,
       20010515, 20020101, 20030101, 20040101, 20050101, 20060101,
       20070101, 20080101, 20090101, 20100101, 20110101, 20120101,
       20130101, 20140101, 20150101])

The last measurement is from 2015 (same as on Wikipedia: https://de.wikipedia.org/wiki/Wiener_Gemeindebezirke#cite_note-11). Let’s take it.

In [48]:

# Slice dataset: filter specific rows and columns
population_2015 = data_population[(data_population['REF_DATE'] == 20150101)][['DISTRICT_CODE', 'POP_TOTAL']]
# Sort column
population_2015.sort_values(ascending=False, by='POP_TOTAL').head()

Out[48]:

	DISTRICT_CODE	POP_TOTAL
630	91000	189713
642	92200	172978
641	92100	151844
622	90200	101702
636	91600	100738

Analyse the Data¶

1. Why some districts have more/less dogs than others?¶

Very good we have the population count per district. Let’s compare it with the dog counts. For this we need to aggregate the dog counts by district and join on the common column, which is the DISTRICT_CODE.

In [50]:

# Aggregate counts
district_dogs = data.groupby('DISTRICT_CODE')['Dog Count'].aggregate(np.sum)
# Join Data Frame and Series objects on the key column
dog_vs_people = population_2015.join(district_dogs, on='DISTRICT_CODE')
dog_vs_people.sort_values(ascending=False, by='POP_TOTAL').head()

Out[50]:

	DISTRICT_CODE	POP_TOTAL	Dog Count
630	91000	189713	5333
642	92200	172978	8504
641	92100	151844	7143
622	90200	101702	2545
636	91600	100738	2077

In [51]:

dog_vs_people[['POP_TOTAL', 'Dog Count']].corr()

Out[51]:

	POP_TOTAL	Dog Count
POP_TOTAL	1.000000	0.903501
Dog Count	0.903501	1.000000

As expected, dog population is positively correlated with human population (dogs usually live together with people in Vienna).

Therefore, we need to consider the district population when comparing the number of dogs across districts.

2. Where should I live if I want to have a dog in Vienna?¶

In [52]:

dog_vs_people['Dogs_per_person'] = dog_vs_people['Dog Count'] / dog_vs_people['POP_TOTAL']
# Sort by column
dog_vs_people.sort_values(ascending=True, by='Dogs_per_person').round(3).head()

Out[52]:

	DISTRICT_CODE	POP_TOTAL	Dog Count	Dogs_per_person
635	91500	76320	1352	0.018
625	90500	54246	982	0.018
628	90800	24518	459	0.019
627	90700	31291	619	0.020
624	90400	31691	632	0.020

5th and 15th districts have the smallest proportion of dogs with respect to the number of inhabitants (18 dogs per 1000 people). We do not yet know why but it is not likely that dog-haters would agree to live closer to each other. There must be some other factors, such as necessary conditions required to have a dog, e.g. dog playgrounds, that influence the decision to have or not to have a dog.

In [53]:

# Sort by column
dog_vs_people.sort_values(ascending=False, by='Dogs_per_person').round(3).head()

Out[53]:

	DISTRICT_CODE	POP_TOTAL	Dog Count	Dogs_per_person
642	92200	172978	8504	0.049
641	92100	151844	7143	0.047
643	92300	96775	4136	0.043
639	91900	69999	2745	0.039
633	91300	52115	1990	0.038

In [57]:

# Load library for visualisation
import seaborn as sns
# Command to show plots in notebook
%matplotlib inline
ax = sns.barplot(palette='colorblind', data=dog_vs_people, y=dog_vs_people.DISTRICT_CODE, x=dog_vs_people.Dogs_per_person, orient='h')
ax.set(ylabel='District', xlabel='Dogs_per_person')

Out[57]:

[<matplotlib.text.Text at 0x11428c978>, <matplotlib.text.Text at 0x1142692b0>]

22nd and 21st districts have the biggest dog populations relative to the number of inhabitants in the district: 49 and 47 dogs per 1000 people respectively. Meaning that if you live in one of this districts you are more likely to have a dog.

Do these districts provide more favorable conditions for the dog owners? Or is it the kind of people that prefer to live in these areas also like to have dogs?

Let us explore the average income per district and whether it has an influence on the dog stats.

Get More Data¶

Income Statistics per District¶

We integrate another dataset for our dog statistics analysis, which luckily follows the same logic (data format) as the previous two datasets, since it is from the same publisher (wien.gv.at).

This dataset was relatively easy to find when searching with the keyword ‚income‘ (Einkommen) on the Austrian State Open Data portal https://www.data.gv.at.

In [58]:

# Load the csv file from the open data portal
# Dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_viewirtschaftwienergemeindebezirke1/resource/c1ad52a3-ca5c-44fc-a945-b6459f73ed98
data_path = 'https://www.wien.gv.at/statistik/ogd/vie_502.csv'
# The data format seems to be consistent for all the datasets from this publisher 
data_income = pd.read_csv(data_path, delimiter=';', skiprows=1)
# Check the top of the table to make sure the dataset is loaded correctly 
data_income.head()

Out[58]:

	NUTS1	NUTS2	NUTS3	DISTRICT_CODE	SUB_DISTRICT_CODE	INCOME_TOTAL	INCOME_MEN	INCOME_WOMEN	AVERAGE_INCOME_TOTAL	AVERAGE_INCOME_MEN	AVERAGE_INCOME_WOMEN	AVERAGE_INCOME_FULLTIME_TOTAL	AVERAGE_INCOME_FULLTIME_MEN	AVERAGE_INCOME_FULLTIME_WOMEN	REF_DATE
0	AT1	AT13	AT130	90100	.	292.103.000	194.301.000	97.802.000	25.463	31.961	18.536	34.506	42.806	24.963	20021231
1	AT1	AT13	AT130	90200	.	919.962.000	559.682.000	360.280.000	16.439	18.301	14.282	22.015	24.312	19.067	20021231
2	AT1	AT13	AT130	90300	.	1.011.882.000	608.994.000	402.888.000	18.701	21.444	15.804	24.617	27.968	20.739	20021231
3	AT1	AT13	AT130	90400	.	402.479.000	243.807.000	158.672.000	20.325	23.641	16.876	27.348	31.586	22.539	20021231
4	AT1	AT13	AT130	90500	.	554.361.000	327.128.000	227.233.000	16.258	17.937	14.395	22.059	24.198	19.454	20021231

In [59]:

# Check unique column values
data_income['REF_DATE'].unique()

Out[59]:

array([20021231, 20031231, 20041231, 20051231, 20061231, 20071231,
       20081231, 20091231, 20101231, 20111231, 20121231, 20131231, 20141231])

The last measurement is from 2014. Let’s take it.

In [60]:

# Slice dataset: filter specific rows and columns
income_2014 = data_income[(data_income['REF_DATE'] == 20141231)][['DISTRICT_CODE', 'AVERAGE_INCOME_TOTAL']]
# Sort column
income_2014.sort_values(ascending=False, by='AVERAGE_INCOME_TOTAL').head()

Out[60]:

	DISTRICT_CODE	AVERAGE_INCOME_TOTAL
276	90100	32.493
288	91300	27.604
294	91900	25.803
279	90400	24.199
293	91800	24.164

All right, this is common knowledge 😉 The wealthiest are 1st and 13th districts of Vienna.

In [61]:

income_2014.sort_values(ascending=True, by='AVERAGE_INCOME_TOTAL').head()

Out[61]:

	DISTRICT_CODE	AVERAGE_INCOME_TOTAL
290	91500	16.799
295	92000	17.861
285	91000	18.340
291	91600	18.733
280	90500	18.952

The poorest are the 15th and 20th districts.

In [62]:

# Plot pandas data frame
ax = sns.barplot(data=income_2014,palette='colorblind', y='DISTRICT_CODE', x='AVERAGE_INCOME_TOTAL', orient='h')
ax.set(ylabel='District', xlabel='Average Income (thousands)')

Out[62]:

[<matplotlib.text.Text at 0x114464048>, <matplotlib.text.Text at 0x1144c10f0>]

Now let’s join all our datasets with district stats.

In [63]:

# Merge data frames on the key column
dog_vs_people_vs_income = dog_vs_people.merge(income_2014, on='DISTRICT_CODE')
dog_vs_people_vs_income.sort_values(ascending=False, by='AVERAGE_INCOME_TOTAL').head()

Out[63]:

	DISTRICT_CODE	POP_TOTAL	Dog Count	Dogs_per_person	AVERAGE_INCOME_TOTAL
0	90100	16339	465	0.028460	32.493
12	91300	52115	1990	0.038185	27.604
18	91900	69999	2745	0.039215	25.803
3	90400	31691	632	0.019943	24.199
17	91800	49178	1316	0.026760	24.164

In [64]:

dog_vs_people_vs_income.corr()

Out[64]:

	DISTRICT_CODE	POP_TOTAL	Dog Count	Dogs_per_person	AVERAGE_INCOME_TOTAL
DISTRICT_CODE	1.000000	0.453962	0.563003	0.592619	-0.213271
POP_TOTAL	0.453962	1.000000	0.903501	0.554129	-0.455340
Dog Count	0.563003	0.903501	1.000000	0.813967	-0.178365
Dogs_per_person	0.592619	0.554129	0.813967	1.000000	0.273133
AVERAGE_INCOME_TOTAL	-0.213271	-0.455340	-0.178365	0.273133	1.000000

There is no strong correlation between the average income and the number of dogs. Everyone likes to have a dog. Maybe a different kind of dog though…

Is there a correlation between a dog breed and the income of its owners?

3. Which dogs are in trend in the rich neighborhoods of Vienna?¶

We need to go back to the breed-level district statistics to answer this question.

In [198]:

# Filter out specific rows
filtered_dogs = data.loc[data['Dog Breed']!='Unbekannt']
filtered_dogs.head()

Out[198]:

	DISTRICT_CODE	Dog Breed	Dog Count
0	90100	Afghanischer Windhund	1
1	90100	Amerikanischer Pit-Bullterrier	1
2	90100	Amerikanischer Staffordshire-Terrier	5
3	90100	Australian Shepherd Dog	3
4	90100	Australian Terrier	1

Next, we need the relative counts instead of the absolute counts for the breeds to be able to compare across districts.

In [218]:

district_stats = filtered_dogs.merge(dog_vs_people, on='DISTRICT_CODE')
district_stats['Breed_Percent'] = ((district_stats['Dog Count_x'] / district_stats['Dog Count_y']) * 100).astype(int)
# Drop 0 values
district_stats = district_stats[(district_stats['Breed_Percent']>0)]
district_stats.sort_values(ascending=False, by='Breed_Percent').head()

Out[218]:

	DISTRICT_CODE	Dog Breed	Dog Count_x	POP_TOTAL	Dog Count_y	Dogs_per_person	Breed_Percent
2610	91900	Labrador Retriever	193	69999	2745	0.039215	7
1672	91300	Labrador Retriever	148	52115	1990	0.038185	7
2432	91800	Labrador Retriever	94	49178	1316	0.026760	7
51	90100	Labrador Retriever	35	16339	465	0.028460	7
1295	91100	Chihuahua kurzhaariger Schlag	271	95198	3539	0.037175	7

In [219]:

# Create a “pivot” table based on column values unfolding the original table stats
unfolded_breeds = district_stats.pivot_table(index='DISTRICT_CODE', columns='Dog Breed', values='Breed_Percent').fillna(0).astype(int)
unfolded_breeds.reset_index(level=0, inplace=True)
unfolded_breeds.head()

Out[219]:

Dog Breed	DISTRICT_CODE	Amerikanischer Pit-Bullterrier	Amerikanischer Staffordshire-Terrier	Beagle	Border-Collie	Boxer	Chihuahua kurzhaariger Schlag	Chihuahua langhaariger Schlag	…	Shih Tzu	Siberian Husky	Toy-Pudel	Weimaraner	West Highland White Terrier	Yorkshire Terrier	Zwergpinscher	Zwergspitz
0	90100	0	1	1	1	1	2	0	…	1	0	1	1	1	1	0	0
1	90200	0	2	1	1	0	5	1	…	1	0	0	0	0	3	1	1
2	90300	0	2	0	1	0	5	0	…	0	1	0	0	1	1	1	0
3	90400	0	1	2	0	0	2	1	…	0	0	0	0	1	1	0	0
4	90500	1	3	1	0	0	4	1	…	0	0	0	0	0	2	1	0

5 rows × 40 columns

In [221]:

breeds_vs_income = unfolded_breeds.merge(income_2014, on='DISTRICT_CODE')
c = breeds_vs_income.corr()
s = c.unstack()
s.sort_values(ascending=False)['AVERAGE_INCOME_TOTAL']

Out[221]:

AVERAGE_INCOME_TOTAL                    1.000000
Labrador Retriever                      0.741681
Toy-Pudel                               0.633378
Schnauzer                               0.633378
Parson-Russell-Terrier                  0.633378
Weimaraner                              0.633378
Flat Coated Retriever                   0.633378
Magyar Vizsla                           0.549162
West Highland White Terrier             0.434112
Pudel                                   0.418268
Rauhhaardackel (Zwergdackel)            0.418139
Border-Collie                           0.388188
Boxer                                   0.349132
Englischer Cockerspaniel                0.223980
Beagle                                  0.223901
Golden Retriever                        0.211484
Dackel                                  0.178157
Zwergschnauzer                          0.089984
Drahthaar-Foxterrier                    0.084796
Spitz                                   0.084796
Cavalier King Charles Spaniel           0.082323
Australian Shepherd Dog                 0.043597
Siberian Husky                          0.030186
Collie                                  0.023761
Mops                                   -0.082396
Jack-Russell-Terrier                   -0.117556
Rottweiler                             -0.160320
Shih Tzu                               -0.174853
Havaneser                              -0.204154
DISTRICT_CODE                          -0.213271
Französische Bulldogge                 -0.296468
Deutscher Schäferhund                  -0.302884
Zwergspitz                             -0.333309
Pekinese                               -0.499141
Zwergpinscher                          -0.499712
Malteser                               -0.538629
Yorkshire Terrier                      -0.561689
Chihuahua langhaariger Schlag          -0.591629
Chihuahua kurzhaariger Schlag          -0.688575
Amerikanischer Pit-Bullterrier         -0.696840
Amerikanischer Staffordshire-Terrier   -0.801987
dtype: float64

That is it! Labradors and Toy-Pudels are positively correlated with higher incomes. Staff, Pit-Bull and Chihuahuas with lower.

4. Open Data for Local Communities¶

Let us analyse district statistics on the example of the 2nd district (Leopoldstadt).

You can replace the district code in the variable below to see the stats for your own district.

In [138]:

district = 90200
# Get the slice of the joint dataset according to the search query
dog_vs_people_vs_income.head()
district_dogs = dog_vs_people_vs_income.loc[dog_vs_people_vs_income.DISTRICT_CODE==district]
district_dogs

Out[138]:

	DISTRICT_CODE	POP_TOTAL	Dog Count	Dogs_per_person	AVERAGE_INCOME_TOTAL
1	90200	101702	2545	0.025024	19.887

Let us compare this stats with the averages across all Vienna districts.

In [139]:

dog_vs_people_vs_income.describe()

Out[139]:

	DISTRICT_CODE	POP_TOTAL	Dog Count	Dogs_per_person	AVERAGE_INCOME_TOTAL
count	23.000000	23.000000	23.000000	23.000000	23.000000
mean	91200.000000	78145.086957	2431.695652	0.028148	22.115696
std	678.232998	45913.577588	2111.114019	0.009538	3.571603
min	90100.000000	16339.000000	459.000000	0.017715	16.799000
25%	90650.000000	45030.000000	922.000000	0.020841	19.236000
50%	91200.000000	76320.000000	1990.000000	0.025111	22.505000
75%	91750.000000	95986.500000	2787.000000	0.034427	23.825500
max	92300.000000	189713.000000	8504.000000	0.049162	32.493000

2nd district is among the largest districts in Vienna in terms of population (in the 4th quantile: 95,986 < 101,702 < 189,713).

It is below average with respect to the official income statistics (in the 2nd quantile: 19,236 < 19,887 < 22,505).

There is a single dog per 40 inhabitants (25 dogs/1,000 people) which is the average in Vienna.

Lessons Learned¶

Dogs in Vienna¶

Based on the data available we were able to provide comprehensive answers to the set of research questions proposed in the introduction.

The dogs are distributed propotional to the number of people across the districts.
However, some districts seem to have fewer dogs than expected (5th and 15th) and some more (21th and 22th), which points to other factors that may influence the decision to acquire a dog pet, such as availability of dog playgrounds in the area or ability to sustain the dog, i.e. income of the owner.
Labradors, Toy-Pudels, Schnauzer, Parson-Russell Terriers, Weimaraner, Flat Coated Retrievers, Magyar Vizsla and West Highland White Terriers are the favorites of the upscale districts in Vienna. This may correlate with the dog prices as well.
Leopoldstadt is a big relatively poor district with an adequate proportion of dogs. Statistics on population and average income provides us additional insights into the situation in our neighborhouds, which may be useful for other types of analysis as well.

Open Data Quality¶

1. Descriptions do not provide enough information on what kind of data the dataset contains, e.g. the date span in our example.
2. Formatting is consistent for the same publisher, which is the good news. Especially if it is a big publisher with many datasets, such as governmental agency.

Process¶

1. Find datasets, e.g. CSV files from open data portals
2. Refine: identify column separator, thousands separator, rows to skip, string encoding, etc.
3. Aggregate: group by different attributes, e.g. district or type, and sum up the counts.
4. Join datasets on the common key, e.g. district code identifier.
5. Check correlation between the columns.

Inspired by¶

Detail Analysis of various Hospital factors on Kaggle.
Wheelan, Charles J. Naked Statistics. 2013

Dogs in Vienna. Part 2: Open Data Integration Tutorial

Open Data Story¶

Research Questions¶

Get the Data¶

Get More Data¶

Population Statistics per District¶

Analyse the Data¶

1. Why some districts have more/less dogs than others?¶

2. Where should I live if I want to have a dog in Vienna?¶

Get More Data¶

Income Statistics per District¶

3. Which dogs are in trend in the rich neighborhoods of Vienna?¶

4. Open Data for Local Communities¶

Lessons Learned¶

Dogs in Vienna¶

Open Data Quality¶

Process¶

Inspired by¶

More Vienna Dog Data¶