Last year we have analysed the state of Open Data in Switzerland in the article "The hitchicker's guide to Swiss Open Governement Data". One part of this overview was a visualization of the datasets that can be found on opendata.swiss. We thought it would be interesting to crawl the page again and see what has changed in 5 months time. Which is quite a bit more than we would have imagined.
Interactive Visualizatons
We used our crawler to download all of the datasets and analyze them (more on that in the method section). The results visualized can be found below:
October 2017
The situation on october 30, 2017. It is visible that tabular data and geodata represent the majority of the datasets. Yet, when it comes to file size, the image archives are the heavyweights. The entries section is not completely representative because not all datasets could by analyzed and it is also not necessarily clear how to count one "entry". See the Methods section for more detailed information.
March 2018
The situation as of march 22, 2018. The Swiss Federal Statistical Office changed its policy and stopped providing direct downloads to its datasets. At the same time it provides now over 1000 datasets as compared to around 600 last year. Apart from this, most of the growth in number of datasets has come from previously smaller data providers like the Swiss Federal Office of Energy.
Obeservations
Exploring the data through the visualizations, we noticed a few changes:
Changed policy of the Federal Statistical Office (FSO)
While the number of datasets provided by the Swiss Federal Statistical Office incresed, the office stopped providing direct links to its data. It can now only be retrieved via their websites with size limitations. This is exactly why the Global Open Data Index and the Open Data Barometer include the questions "Is the data downloadable at once/available as a whole?". It would be interesting to know what reasons the FSO had to change this.
New datasets mostly from formerly small data providers
The bigger data providers have not added as many new datasets in the last five months. The explanation is probably simple: The low-hangig fruit have all been publsihed and it is harder to find, prepare and publish new datasets for them. As opposed to the smaller data providers.
A few live-datasets contribute most of the entries
The vsiualization also shows, a few datasets have a lot (billions) of entries (see the Method section for a definition) while most have only a few thousand (the median is at 8511/3394 entries for the 2017/2018 datasets).
The log-log plot shows that the distribution hints at a power law distribution. The reason for this are mostly the live-datasets about public transportation from SBB and VBZ. Because this data has grown steadily and considerably over the last five months, it even manages to compensate for the pretty large FSO-datasets, which are missing in the 2018 analysis.
Wrap-up
It's interesting to see how things have changed in just five months. And probably there are many more interesting insights to be found in the data. Feel free to run the crawler yourself or just check out our precrawled datasets.
Method
Information on individual datasets was retrieved via the opendata.swiss-API.
To determine the precise filesize and the number of entries of individual datasets, all the data has been downloaded. We have excluded from this all the datasets which where not available via the API or which have not been linked to directly. This is a first source of errors. It concerns between 4% and 6% of the datasets. To determine the total size of one dataset, we have added up the sizes of all the files that are part of it. This is the next source of errors. Some dataset consist different files, representing different data. Other datasets provide the same data multiple times, but in different formats.
Determining the number of entries is another area which is not clear cut. We defined entry differently, depending on the type of data:
- Tabular data (TSV, CSV, PC-AXIS, XLS): Number of fields/cells in a table.
- Geodata (Shapefiles): Number of entries (points, shapes) × number of attributes per entry
- Image data: Number of images
For some datasets, we could not calculate this value because we couldn't detect the format automatically or because the data was not directly accessible from from opendata.swiss. In 2017 we managed to calculate a total for 90% of the dataset. This value dropped to only 60% in 2018. The main reason is that the Swiss Federal Statistical Office (one of the biggest data providers) stopped providing direct links to its data.
The visitor counts can't be accessed via the API and have been provided by the Swiss Federal Archive. The visitor count on one dataset has been calculated by summing the visits on all the subsites.