Link-VGI: LINKing and analyzing Volunteered Geographic Information (VGI) across different platforms - Report of open floor discussion

The Link-VGI Workshop open floor discussion was moderated by the three principal workshop organisers: Peter Mooney, Alexander Zipf and Hartwig Hochmair. All participants in the workshop were invited to take part in the discussions. The discussion time was limited to 45 minutes. This is a short summary of the discussion as reported by Peter Mooney who acted as rapporteur for this session.

Current status of our work in Link-VGI and the challenges we face

Now that we are attempting to link different sources or streams of data together there is a need to identify the same object across multiple datasets. This is a difficult problem because often the same object is represented differently (thematically, geometrically, semantically, etc) in the different datasets. As we add more datasets to this linkage this problem grows more difficult. Automated ways to perform this object ldetection and linkage will be necessary as the size of the datasets grows. There are some ideas and concepts about how this could be achieved such as using semantic signatures of different objects or understanding some spatial or temporal context of an object in order to identify it. While semantics are important for this problem it is not the only important aspect and we will have to consider several aspects of the object's context during the identification process. To meet this research need perhaps we must think more innovatively? An interesting example was described that when geospatial data began to be produced and generated in larger volumes, quantities and frequencies three or four decades ago the solution was to add geometries to relational databases. This is a solution we all now take for granted.

Current research work, applications developed, etc appear to indicate that linking data volunteered geographic information together across different platforms does yield very interesting results and new insights. However do we as a research community know what additional questions we could answer using the Link-VGI approach? Are there more complex questions which we could seek to answer from the Link-VGI approach? There is a lot of work associated with linking these VGI datasets together before we can even begin to investigate the answers to our research questions. So we should strive to ensure that the questions we are aksing of the data are worthwhile in terms of level of effort and work required to get to the position of answering questions and extracting actionable knowledge from linked VGI. The application of data mining amd machine learning algorithms must improve in order for those approaches to yield better results.

Ethics and privacy in Link-VGI

Informally in Link-VGI we consider any source of user-generated content (UGC) which has a spatial component as VGI. But we must consider the conceptual arrangement of this information. We have different forms of VGI - at the simplest division we have active VGI and involutary or passive VGI. This leads to the question if social media data and information can really be considered as volunteered information? In the case of the latter we find ourselves confronted with a range of ethics and privacy issues of linking these sources of data together. No real ground rules have been established so far. Consider the following example: Suppose a citizen is using a social media/network application and this data is accessible by the research community through an API or some open data arrangement. What are the ethics and privacy issues arising when this citizens data is used and linked with other forms of VGI? Does the citizen understand that this can happen? Does this citizen know this type of arrangement is even possible? This is an opportune time for the research community to consider these ethical and data privacy issues.
The nature of the exact VGI information or data used and which use-case it is applied to may help to determine which legal, ethical and privacy issues which are most prominent. These are yet to find a concentration of research effort. Authors such as Scassa have voiced their concerns about when information about individual citizens is transferred and presented within a geographical context arguing that the resulting profile information could be both "highly revelatory and involuntary" (Scassa, 2012) and this can raise important ethical issues that need to be addressed. As it is anticipated that VGI will increasingly be harvested from sources as diverse as social-media and wearable devices which while potentially yielding vast amounts of useful VGI brings with a wide range of concerns on privacy, legal and ethical issues.

Considering these issues around the use of citizen-generated data the discussion group believed that this research work in Link-VGI must begin to reach real end users. Very often our work remains within the research environment - the research community become users of the outputs of this work. But surely this is not specifically who we are devoting all this time and resources for? Real end users are citizens in cities, towns and rural areas, national and regional governments and planning authorities, emergency services, etc. But to confidently reach out to these end users our methods and applications must scale both geographically and temporally. Otherwise our work will not make the bridge between our research labs and application, usage and adoption in real-world scenarios.

Content Analysis in Link-VGI

As discussed and explored in the workshop Link-VGI sees us consider many different forms of VGI. In particular we consider VGI from social networking and social media such as Facebook, Flickr, Foursquare, Instagram, Yelp, etc. Much of this content is text. The goal posts are shifted from the traditional quantitative and structured geospatial data we as the GIS community are very familiar and comfortable with. Lexical analysis of the text in these data streams must compliment the traditional geospatial analysis we can perform. Two major Computer Science research areas must be considered namely text content analysis and image recognition (for social media such as Flickr or Instagram). But the geoinformatics community or geocomputation community may lack skills in text content analysis and image recognition. Both text content analysis and image recognition are very well established research areas within Computer Science. However the workshop group asked if these two areas are best tackled by us within the geoinformatics/geocomputation community or should we work to expand and collaborate with experts in Natural Language Processing and Computer Vision?

The more data we collect the more heterogeneous our data becomes. We have to work very hard to link these datasets and we will also have to work very hard to analyse them and extract knowledge from them. The discussion group agreed that around 70% of the work in Link-VGI is often expended on joining the data. This leaves us with only 30% of our time and resources for actually answering questions with our linked data. Given this scenario it is imperative that we seek collaborations with other researchers who are expert in areas such as natural language process and computer vision .

The problem of bias

As discussed above to actually link VGI datasets we often perform many "technical tweeks" to make the linkage happen or to make the linkage actually work. Suppose we take VGI datasets A, B and C and join them. The linked dataset (A + B + C) is a "new" dataset. But we must realise that this "new" dataset has collected all of the bias and the problems of its individual components. The linkage of these three datasets has created a new dataset with very undefined characteristics. It is even more heterogeneous than its individual parts. The workshop discussion group felt that this is a major obstacle in the reuse or application of these types of linked datasets in VGI. Some of the participants felt that at present such as a scenario means that we will find it difficult to go beyond very pratical examples of using these new linked dataset. For example combining datasets about car journeys and GPS tracks may only be able to tell us information about where and when traffic jams are happening rather than giving us new insights into social mobility or the efficiency of the transporation network. At this point in time could we consider that building an ecological model of a regional or country is feasible or possible from linking VGI datasets together? Of course with the discussion of bias comes the orthogonal discussions of uncertainty, accuracy etc. All of these aspects provide major challenges for us and will continue to do so.

A "new" form of geographic data

As everyone who has worked with location-based data knows location is a very precision attribute in any dataset. As we have said the availability of APIs and Open Data services has made it possible for use to combine and link datasets and data sources which otherwise would be very difficult (if not impossible) to link. This is an exciting time. Is geospatial data growing? Most definitely yes! If linking, fusing and integrating these heterogenous datasets is to continue do we need to think differently about how we store, manage and structure this new form of geographic data? If one thinks about this carefully this is what Linked Data is trying to achieve. Linking datasets and data objects through attributes and relationships. While examples of linked geospatial data are still relatively few the inclusion of geospatial data this could be considered as a longer term goal of the semantic web. Linked data provides a framework to link data together. But what will define the geospatial linkage? Will this be done automatically or manually?

People are the most important factor in the generation of this new form of geographic data. Who are the people who are contributing VGI (active and passive) at the moment? The discussion group considered the engagement with users of social media or creators of VGI. How do we engage people to "create" social media or VGI data about particular geographic features, themes, environmental situations, etc? This brings us back to what the appropriate rewards are for contributing and the motivations to contribute. Other factors to consider are for example the influence that the data collection protocol or framework has on the amount and quality of data provided.