Dogs in Vienna. Part 0: Loading the Data from the Vienna Open Data Portal

Autor Svitlana VakulenkoVeröffentlicht am

This is a sample script showing how open data datasets from the Vienna’s Open Data portal can be analysed and visualized using Jupyter Notebooks. We take the dog statistics data in Vienna as a sample use case to demonstrate common approaches to analysing open data. This is a first part of our data story, in which we focus on data loading and pre-processing.

Get the Data

Let’s load a dataset and make sure it is loaded correctly, so that we can use it for our analysis. It is often not that trivial as it may seem and involves a lot of data wrangling and debugging in order to find and eliminate possible errors or inconsistences in the dataset.

This step should not be underestimated since it defines the final result of our data analysis.
Remember: „Garbage in, garbage out!“

1. Delimiter

In [17]:
# Load the library for CSV file processing
import pandas as pd
# Load the csv file from the open data portal
# dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_anzahlderhundeprobezirkderstadtwien/resource/b8d97349-c993-486d-b273-362e0524f98c
data_path = 'https://www.wien.gv.at/finanzen/ogd/hunde-wien.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data = pd.read_csv(data_path, delimiter=';', encoding='latin-1')
data.head()
Out[17]:
Hunderassen in Wien | Breed of Dogs in Vienna Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8
0 NUTS1 NUTS2 NUTS3 DISTRICT_CODE SUB_DISTRICT_CODE Postal_CODE Dog Breed Anzahl Ref_Date
1 AT1 AT13 AT113 90100 . 1010 Afghanischer Windhund 1 20161201
2 AT1 AT13 AT113 90100 . 1010 Amerikanischer Pit-Bullterrier 1 20161201
3 AT1 AT13 AT113 90100 . 1010 Amerikanischer Staffordshire-Terrier 4 20161201
4 AT1 AT13 AT113 90100 . 1010 Amerikanischer Staffordshire-Terrier / Mischling 1 20161201

Oops the first row of our CSV file contains the name of the table, which we should skip.

2. Skip rows

In [18]:
# Reload dataset
data = pd.read_csv(data_path, delimiter=';', encoding='latin-1', skiprows=1)
# Check the top of the table to make sure the dataset is loaded correctly 
data.head()
Out[18]:
NUTS1 NUTS2 NUTS3 DISTRICT_CODE SUB_DISTRICT_CODE Postal_CODE Dog Breed Anzahl Ref_Date
0 AT1 AT13 AT113 90100 . 1010 Afghanischer Windhund 1 20161201
1 AT1 AT13 AT113 90100 . 1010 Amerikanischer Pit-Bullterrier 1 20161201
2 AT1 AT13 AT113 90100 . 1010 Amerikanischer Staffordshire-Terrier 4 20161201
3 AT1 AT13 AT113 90100 . 1010 Amerikanischer Staffordshire-Terrier / Mischling 1 20161201
4 AT1 AT13 AT113 90100 . 1010 Australian Shepherd Dog 3 20161201

The top rows look just fine. But what about the rest?

3. Dataset statistics

In [3]:
data.shape
Out[3]:
(5740, 9)

Our dataset contains 5740 rows and 9 columns.

In [4]:
# Check the column types to make sure the dataset is loaded correctly
data.dtypes
Out[4]:
NUTS1                object
NUTS2                object
NUTS3                object
DISTRICT_CODE         int64
SUB_DISTRICT_CODE    object
Postal_CODE           int64
Dog Breed            object
Anzahl               object
Ref_Date              int64
dtype: object

Count (Anzahl) is not recognized as numeric data. We shall fix this!

4. Numeric data

In [20]:
# Check the rows that are not recognized as numeric
data[data.Anzahl.apply(lambda x: not x.isnumeric())]
Out[20]:
NUTS1 NUTS2 NUTS3 DISTRICT_CODE SUB_DISTRICT_CODE Postal_CODE Dog Breed Anzahl Ref_Date
1914 AT1 AT13 AT113 91000 . 1100 Unbekannt 1,51 20161201
2210 AT1 AT13 AT113 91100 . 1110 Unbekannt 1,029 20161201
4879 AT1 AT13 AT113 92100 . 1210 Unbekannt 2,066 20161201
5347 AT1 AT13 AT113 92200 . 1220 Unbekannt 2,46 20161201
5715 AT1 AT13 AT113 92300 . 1230 Unbekannt 1,215 20161201

Hmmm that looks interesting. 1,51 of a dog. What could that possibly mean?

Maybe 0s in the end of thousands got accidentally droped?
This does indeed look like a plausible explanation for these numbers. Lets try to fix this.

In [21]:
# Reload dataset
data = pd.read_csv(data_path, delimiter=';', encoding='latin-1', skiprows=1, thousands=',')
# Check the column types to make sure the dataset is loaded correctly
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5740 entries, 0 to 5739
Data columns (total 9 columns):
NUTS1                5740 non-null object
NUTS2                5740 non-null object
NUTS3                5740 non-null object
DISTRICT_CODE        5740 non-null int64
SUB_DISTRICT_CODE    5740 non-null object
Postal_CODE          5740 non-null int64
Dog Breed            5740 non-null object
Anzahl               5740 non-null int64
Ref_Date             5740 non-null int64
dtypes: int64(4), object(5)
memory usage: 403.7+ KB

Looks good: count is recognized as integer (Anzahl int64).
But I want to make sure the numbers were translated correctly. Lets find these rows.

In [22]:
# Filter raws based on multiple columns
data[(data['Dog Breed']=='Unbekannt') & (data['Postal_CODE']==1100)]
Out[22]:
NUTS1 NUTS2 NUTS3 DISTRICT_CODE SUB_DISTRICT_CODE Postal_CODE Dog Breed Anzahl Ref_Date
1914 AT1 AT13 AT113 91000 . 1100 Unbekannt 151 20161201

Oh no that did not work magic. We need to fix these cases manually.

In [23]:
# Correct individual values in the dataset
data.loc[1914, 'Anzahl'] = 1510
data.loc[5347, 'Anzahl'] = 2460
# Filter raws based on multiple columns
data[(data['Dog Breed']=='Unbekannt') & (data['Postal_CODE']==1100)]
Out[23]:
NUTS1 NUTS2 NUTS3 DISTRICT_CODE SUB_DISTRICT_CODE Postal_CODE Dog Breed Anzahl Ref_Date
1914 AT1 AT13 AT113 91000 . 1100 Unbekannt 1510 20161201

5. Strings

Strings (textual data) can be also tricky to handle. Let’s have a closer look at the column containing dog breeds.

In [24]:
data['Dog Breed'].unique()
Out[24]:
array([u'Afghanischer Windhund', u'Amerikanischer Pit-Bullterrier',
       u'Amerikanischer Staffordshire-Terrier', ...,
       u'Windhund / Amerikanischer Staffordshire-Terrier',
       u'Wolfsspitz / Labrador Retriever', u'Zwergspitz / Dackel'], dtype=object)

It seems that the mixed breeds (different breeds of the parents) are entered using sign ‚/‘. See row 3: Amerikanischer Staffordshire-Terrier / Mischling. Let’s separate such strings into several columns to be able to aggregate this kind of data easily.

In [25]:
# Carefully select the string separator, including spaces!
separate_breeds = data['Dog Breed'].str.split(' / ', expand=True)
separate_breeds.head()
Out[25]:
0 1
0 Afghanischer Windhund None
1 Amerikanischer Pit-Bullterrier None
2 Amerikanischer Staffordshire-Terrier None
3 Amerikanischer Staffordshire-Terrier Mischling
4 Australian Shepherd Dog None

OK. Let’s give names to the new columns and add them to our dataset.

In [15]:
separate_breeds.columns = ["Breed_1", "Breed_2"]
data_breeds = pd.concat([data, separate_breeds], axis=1)
# Check the bottom of the table for a change
data_breeds.tail()
Out[15]:
NUTS1 NUTS2 NUTS3 DISTRICT_CODE SUB_DISTRICT_CODE Postal_CODE Dog Breed Anzahl Ref_Date Breed_1 Breed_2
5735 AT1 AT13 AT113 92300 . 1230 Zwergschäferhund / Mischling 1 20161201 Zwergschäferhund Mischling
5736 AT1 AT13 AT113 92300 . 1230 Zwergschnauzer 7 20161201 Zwergschnauzer None
5737 AT1 AT13 AT113 92300 . 1230 Zwergspitz 14 20161201 Zwergspitz None
5738 AT1 AT13 AT113 92300 . 1230 Zwergspitz / Dackel 1 20161201 Zwergspitz Dackel
5739 AT1 AT13 AT113 92300 . 1230 Zwergspitz / Mischling 9 20161201 Zwergspitz Mischling

Oops some of the characters are not recognized correctly, we shall fix that!

In [16]:
# Correct encoding for special characters in german alphabet
def to_utf(x):
    return x.encode('latin-1').decode('utf8') if isinstance(x, str) else x   
data_breeds_corrected = data_breeds.applymap(to_utf)
data_breeds_corrected.tail()
Out[16]:
NUTS1 NUTS2 NUTS3 DISTRICT_CODE SUB_DISTRICT_CODE Postal_CODE Dog Breed Anzahl Ref_Date Breed_1 Breed_2
5735 AT1 AT13 AT113 92300 . 1230 Zwergschäferhund / Mischling 1 20161201 Zwergschäferhund Mischling
5736 AT1 AT13 AT113 92300 . 1230 Zwergschnauzer 7 20161201 Zwergschnauzer None
5737 AT1 AT13 AT113 92300 . 1230 Zwergspitz 14 20161201 Zwergspitz None
5738 AT1 AT13 AT113 92300 . 1230 Zwergspitz / Dackel 1 20161201 Zwergspitz Dackel
5739 AT1 AT13 AT113 92300 . 1230 Zwergspitz / Mischling 9 20161201 Zwergspitz Mischling

6. Understand your data

In [17]:
# Check descriptive statistics
data_breeds_corrected.describe()
Out[17]:
DISTRICT_CODE Postal_CODE Anzahl Ref_Date
count 5740.000000 5740.000000 5740.000000 5740.0
mean 91355.000000 1135.500000 9.743728 20161201.0
std 668.274742 66.827474 61.095023 0.0
min 90100.000000 1010.000000 1.000000 20161201.0
25% 90900.000000 1090.000000 1.000000 20161201.0
50% 91400.000000 1140.000000 2.000000 20161201.0
75% 92000.000000 1200.000000 5.000000 20161201.0
max 92300.000000 1230.000000 2460.000000 20161201.0

The district and postal codes range from 90100 (1010) to 92300 (1230) indicating the 23 districts of Vienna (Wiener Bezirke).

In [18]:
data_breeds_corrected['Ref_Date'].unique()
Out[18]:
array([20161201])

All rows describe the data for a single reference date: 2016 12 01. Since the date format is not explicitly specified, it is not clear though whether it is the 1st of December or the 12th of January.

Essentially the dataset boils down to the information: District | Dog Breed | Dog Count

To simplify further analysis we drop information about the mixed breeds and take the first breed as the main dog breed.

In [19]:
# Load the library for vector/matrix manipulations
import numpy as np
data_breed_1 = data_breeds_corrected.groupby(['DISTRICT_CODE', 'Breed_1'])['Anzahl'].aggregate(np.sum).reset_index()
data_breed_1
Out[19]:
DISTRICT_CODE Breed_1 Anzahl
0 90100 Afghanischer Windhund 1
1 90100 Amerikanischer Pit-Bullterrier 1
2 90100 Amerikanischer Staffordshire-Terrier 5
3 90100 Australian Shepherd Dog 3
4 90100 Australian Terrier 1
5 90100 Basset Hound 1
6 90100 Beagle 8
7 90100 Beauceron 1
8 90100 Bergamasker Hirtenhund 1
9 90100 Berner Sennenhund 1
10 90100 Bernhardiner 1
11 90100 Bichon Frisé 1
12 90100 Bolonka Zwetna 3
13 90100 Border-Collie 5
14 90100 Bosnischer Rauhhaariger Laufhund 1
15 90100 Boston Terrier 2
16 90100 Boxer 5
17 90100 Bracke 2
18 90100 Bulldog 1
19 90100 Bullterrier 2
20 90100 Cairn-Terrier 1
21 90100 Cao de Agua Portugues 2
22 90100 Cavalier King Charles Spaniel 3
23 90100 Chihuahua kurzhaariger Schlag 12
24 90100 Chihuahua langhaariger Schlag 1
25 90100 Chow-Chow 2
26 90100 Cockerpoo 2
27 90100 Collie 2
28 90100 Dackel 14
29 90100 Dalmatiner 1
3433 92300 Sheltie 21
3434 92300 Shiba Inu 12
3435 92300 Shih Tzu 33
3436 92300 Siberian Husky 31
3437 92300 Soft-Coated Wheaten Terrier 1
3438 92300 Spanischer Windhund 7
3439 92300 Spinone 1
3440 92300 Spitz 21
3441 92300 Staffordshire-Bullterrier 16
3442 92300 Steirische Rauhhaarige Hochgebirgsbracke 1
3443 92300 Tervueren 1
3444 92300 Thai Ridgeback 1
3445 92300 Tibet-Spaniel 1
3446 92300 Tibet-Terrier 9
3447 92300 Unbekannt 1215
3448 92300 Weimaraner 2
3449 92300 Weißer Schweizer Schäferhund 8
3450 92300 Welsh Corgi Cardigan 1
3451 92300 Welsh Terrier 3
3452 92300 West Highland White Terrier 40
3453 92300 Whippet 11
3454 92300 Windhund 1
3455 92300 Wolfsspitz 4
3456 92300 Yorkshire Terrier 87
3457 92300 Zwergpinscher 60
3458 92300 Zwergpudel 14
3459 92300 Zwergschnauzer 7
3460 92300 Zwergschäferhund 1
3461 92300 Zwergspitz 24
3462 92300 Österreichischer Pinscher 3

3463 rows × 3 columns

Now the data is ready for analysis!

Lessons Learned

Open Data Quality

1. Descriptions are usually absent or incomplete.
You need to look inside the file and spend some time to understand what the dataset contains.
2. Column names in the header are cryptic, e.g., NUTS3.
3. Correct formatting is not granted either.
You need to figure out the correct way to load the dataset including delimiters, encoding etc.

Steps

1. Find a dataset, e.g. a CSV file from an open data portal.
2. Refine:
2.1. Process CSV: identify column separator, thousands separator, rows to skip, string encoding, etc.
2.2. Identify numeric data, e.g. sums or counts, that can be compared and aggregated.
2.3. Split strings into several columns, when necessary.
2.4. Correct some of the data values manually, when necessary.

Prerequisites

To run this script (Download) on a local machine you need:

  • Python 3.4.
  • pandas
  • numpy

Inspired by

  1. Kaggle project: Detail Analysis of various Hospital factors
  2. Wheelan, Charles J. Naked Statistics. 2013

This code is on GitHub!