This is a short tale of how I took another city’s data in order to expand the information we can provide for people / apps to consume. It also acts as a warning as to some of the unexpected side effects of using automatically generated data that has not been checked by a human.
Preparing for new flavours of data
The original Aberdeen data within MatchTheCity was initially scraped or manually entered. The second phase saw the use of official feeds provided by a third party that are also used within the venues’ websites.
The challenge of adding Edinburgh data was an exercise in seeing how generic/flexible/reusable I had made the MatchTheCity server. It would be an experiment in migrating the data from one format into the format I required. Also it would help to identify the shortcomings in the design so they could be corrected.
Prior to the EdinburghApps event I upgraded MatchTheCity to include a region for each venue. This allowed a single MatchTheCity server instance to host data for multiple cities.
Finding the new data
Before I could actually add Edinburgh data I first had to find it. Using the Edinburgh CKAN server allowed me to explore their unfamiliar data sets. Entering the key word ‘sports’ I found a very useful set of venues.
I wrote a simple importer in Ruby to take the CSV file and place the data into MatchTheCity’s Venues table. Once imported into MatchTheCity I was quickly able to see them on the map of my proof of concept Android app.
This new data did highlight a bug in the Android app that the Aberdeen data had never triggered. One of the venues was St Margret’s and the ‘ in the name was causing problems in the SQL query used for checking whether the venue already exists. Some digging around StackOverflow pointed me in the direction to fix it.
Data wrong turn
On closer inspection it turned out that the Sports and Recreation venue data I had used was not the actual data I required. More exploring of the CKAN server revealed my targets were under the heading of Edinburgh Leisure data rather than Sports and Recreational Facilities.
This new data consisted of 3 CSV files:
- Leisure Sites – list of venues that map to MatchTheCity venues
- Leisure Activities – list of activity names and reference IDs that map to MatchTheCity activities
- Leisure Classes – list of actual classes taking place at the sites that map to MatchTheCity opportunities
The initial list of venues had included geolocation co-ordinates. However the second list did not and also the venue names varied. As the classes referenced the second venues list I decided to remove the first list from the MatchTheCity server.
In order to get the geolocation co-ordinates I called out to the Open Street Map Nominatim service to geocode them from the addresses. This was not 100% successful as two of the venues failed to resolve.
After the classes had been imported there were about 18,500 new events in the MatchTheCity server. Compared to the ~650 in the Aberdeen dataset this was a massive amount of events, although a larger number was to be expected given the size of Edinburgh compared to Aberdeen.
However, when I started to browse the new events I started to discover clues as to why there were so many new entries. Amongst the familiar class names such as archery, body pump, etc., some novel exercise class names were appearing. Floor Walking was an activity on offer at several venues. Now this was either classes for people wishing to follow in the footsteps of Captain Peacock or some internal activity going on. Classes such as Cleaning, Administration, Break and Not Booked confirmed that the data was just a straight dump of the entire calendar rather than public classes.
In order to make the data relevant to the consumers, at some point I’m going to filter out these events and reimport. In the long term the hope is that the data will be cleaned prior to it being published on the Edinburgh open data server.
Although maybe filtering this out is a premature reaction based on my needs rather than those of others. For instance, maybe someone can make use of the Not Booked events in order to allow individuals to make use of these empty spaces for other purposes?
I encountered a similar issue of being given data that was originally intended for something else when working with a database of community groups from Aberdeen City Council in that the information contained irrelevant data that needed to be cleaned.
Forced into action
The web front end to MatchTheCity was just for convenience rather than a finished product. It worked well for the amount of Aberdeen data but the increase from 650 to 19,200 events made it less than user friendly. Even more so when I pushed the updates to Heroku; everything ground to a halt due to going over the memory limits. This forced me to address some things that have been on the to-do list for a while and finally add pagination to the web front end. This fixed the memory issue and, unexpectedly for a tech project, everything was now working just in time for the end of event presentations.
Food for thought
Not only was EdinburghApps a good chance to make some technical progress with MatchTheCity, it was also a great opportunity to discuss the theoretical value of it all. It gave me a chance to share thoughts and experiences on the sports activities based project with people working on similar projects that can make use of the MatchTheCity data feeds.
Some of the reasons for having a single server for all cities instead of a server per city were identified as follows:
- lets people living in one region easily use facilities in neighbouring regions
- lets people on holiday plan before they go on a familiar UI
- plan when going on away sports matches – I.e. Challenging tennis matches around the county
And so MatchTheCity continues to evolve, but as they say, that’s another story for next time.