The People’s Web Meets NLP: Collaboratively Constructed Language Resources

Edited Volume – Call for Contributions “The People’s Web Meets NLP: Collaboratively Constructed Language Resources”

Springer webpage:

Springer book series: “Theory and Applications of Natural Language Processing”, E. Hovy, M. Johnson and G. Hirst (eds.)


  • Prof. Dr. Iryna Gurevych
  • Dr. Jungi Kim

Table of Contents

Part I Approaches to Collaboratively Constructed Language Resources

1. Using Games to Create Language Resources: Successes and Limitations of the Approach

Jon Chamberlain, Karën Fort, Udo Kruschwitz, Mathieu Lafourcade and Massimo Poesio

2. Senso Comune: A Collaborative Knowledge Resource for Italian

Alessandro Oltramari, Guido Vetere, Isabella Chiari, Elisabetta Jezek, Fabio Massimo Zanzotto, Malvina Nissim, and Aldo Gangemi

3. Building Multilingual Language Resources in Web Localisation: A Crowdsourcing Approach

Asanka Wasala, Reinhard Schäler, Jim Buckley, Ruvan Weerasinghe and Chris Exton

4. Reciprocal Enrichment Between Basque Wikipedia and Machine Translation

Iñaki Alegria, Unai Cabezon, Unai Fernandez de Betoño, Gorka Labaka, Aingeru Mayor, Kepa Sarasola and Arkaitz Zubiaga

Part II Mining Knowledge From and Using Collaboratively Constructed Language Resources

5. A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia

Oliver Ferschke, Johannes Daxenberger and Iryna Gurevych

6. ConceptNet 5: A Large Semantic Network for Relational Knowledge

Robert Speer and Catherine Havasi

7. An Overview of BabelNet and its API for Multilingual Language Processing

Roberto Navigli and Simone Paolo Ponzetto

8. Hierarchical Organization of Collaboratively Constructed Content

Jianxing Yu, Zheng-Jun Zha, and Tat-Seng Chua

9. Word Sense Disambiguation using Wikipedia

Bharath Dandala, Rada Mihalcea, and Razvan Bunescu

Part III Interconnecting and Managing Collaboratively Constructed Language Resources

10. An Open Linguistic Infrastructure for Annotated Corpora

Nancy Ide

11. Towards Web-Scale Collaborative Knowledge Extraction

Sebastian Hellmann, Sören Auer

12. Building a Linked Open Data Cloud of Linguistic Resources: Motivations and Developments

Christian Chiarcos, Steven Moran, Pablo N. Mendes, Sebastian Nordhoff, Richard Littauer

13. Community Efforts around the ISOcat Data Category Registry

Sue Ellen Wright, Menzo Windhouwer, Ineke Schuurman, Marc Kemps-Snijders


It’s a pleasure to write the Foreword for the book on Collaboratively Constructed Language Resources.

I believe that the trend of collaborative construction of Language Resources (LRs) represents both a “natural” evolution of computerised resource building (I’ll try to give few historical hints) and a “critical” evolution for the future of the field of language resources.

Some historical hints

Where does collaborative resource construction position itself in the language resource field?

I’ll just give a glimpse here at some historical antecedents of the current collaborative methodology, without mentioning the obvious ones, like Wikipedia or Wiktionary.

19th century lexicographic enterprise

We have not invented collaborative construction of language resources, or even crowdsourcing, just recently.

George P. Marsh used it already in 1859 for the Philological Society of London for “the preparation of a complete lexicon or thesaurus of the English language”, the New English Dictionary (now known as the Oxford English Dictionary).

Acting as Secretary in America he decided to “adopt this method of bringing the subject to the notice of persons in this country who may be disposed to contribute to the accomplishment of the object, by reading English books and noting words …”. Moreover: “ … the labors of the English contributors are wholly gratuitous”.

Given that not much material was collected after this appeal, a similar appeal ( was re-launched twenty years later, in 1879,

by the dictionary's editor James Murray when “volunteer readers were recruited to contribute words and illustrative quotations”:

“… the Committee want help from readers in Great Britain, America, and the British Colonies, to finish the volunteer work so enthusiastically commenced twenty years ago …”,

and “A thousand readers are wanted, and confidently asked for, to complete the work as far as possible within the next three years, so that the preparation of the Dictionary may proceed upon full and complete materials.”

We can’t deny that this is a clear example of collaborative construction of a language resource! It could even be defined as an early example of crowdsourcing.

More recent examples: some EC resource projects of the 20th century

Other – more recent – examples could be found in the policy adopted in projects funded by the European Commission, in the ‘90s, where many language resources had to be collaboratively built inside a consortium of many partners. Also because of this “enforced” collaboration, some features and trends presenting clear connections with the current notion of “collaborative building” emerged in the first half of the ‘90s:

The need to build a core set of LRs, designed in a harmonised way, for all the EU languages

The need to base LR building on commonly accepted standards

The need to make the LRs that are created available to the community by large, i.e. the need for a distribution policy (at that time we introduced the notion of distributing resources, not yet sharing them!).

By the way, these requirements are strictly implied by and related to the emerging notion in the ‘90s of the “infrastructural role” of LRs.

I just mention two types of collaborative resource building in EC projects, representing two partially different building models.

One method could be represented by the EuroWordNet projects: each partner was building the WordNet for her/his language, all modelled on – and linked to – the original Princeton WordNet, and altogether constituting a homogeneous and interrelated set of lexicons.

Another method is represented by projects like PAROLE and SIMPLE, for the construction and acquisition of harmonised resources. They were, to my knowledge, the first attempt at developing together medium-size coverage lexicons for so many languages (12 European languages), with a harmonised common model, and with encoding of structured semantic types and syntactic (subcategorisation) and semantic frames on a large scale. Reaching a common agreed model grounded on sound theoretical approaches within a very large consortium, and for so many languages, was in itself a challenging task. The availability of these uniformly structured lexical and textual resources, based on agreed models and standards, in so many EU languages, offered the benefits of a standardised base, creating an infrastructure of harmonised LRs throughout Europe.

What was interesting was that these projects positioned themselves inside the strategic policy – supported by the EC – aiming at providing a core set of language resources for the EU languages based on the principle of “subsidiarity”. According to the subsidiarity concept, the process started at the EU level continued at the national level, extending to real-size the core sets of resources in the framework of a number of National Projects.

This achievement was of major importance in a multilingual space like Europe, where all the difficulties connected with the task of LR building are multiplied by the language factor. All the various language resource projects determined also the beginning of the interest in standardisation in Europe. It was seen as a waste of money, effort and time the fact that every new project was redoing from scratch the same type of (fragments of) LRs, without reusing what was already available, while LRs produced by the projects were usually forgotten and left unused. From here, the notion of “reusability” arose. As a remedy, a clear demand for interoperability standards and for common terms of reference emerged.

Reusability and integration of language resources

Other requirements with connections to collaborative construction of LRs are the possibility to reuse and integrate different language resources.

LRs (i.e. data) started to be understood as critical to make steps forward in NLP already in the ‘80s, marking a sort of revolution with respect to times and approaches in which they were even despised as an uninteresting burden. The 1986 Grosseto (Tuscany) Workshop “On automating the lexicon” was the event marking this inversion of tendency and the starting point of the process which gradually brought the major actors of the NLP sector to pay more and more attention to so-called “reusable” language resources.

In 1998, in a keynote talk at the 1st LREC in Granada, I could state that “Integration of different types of LRs, approaches, techniques and tools must be enforced” as a compelling requirement for our field: “The integration aspect is becoming – fortunately – a key aspect for the field to grow. This is in fact a sign of maturity: today various types of data, techniques, and components are available and waiting to be integrated with not too great an effort. I believe that this integration task is an essential step towards ameliorating the situation, both in view of new applicative goals and also in view of new research dimensions. The integration of many existing components gives in fact more than the sum of the parts, because their combination adds a different quality.”

Among the combinations to be explored I mentioned: interaction between lexicon and corpus, integration of different types of lexicons, of various components in a chain (what we call today workflows), of Written and Spoken LRs towards multimedia and multimodal LRs, and also integration of symbolic and statistical approaches. I observed that “a single group simply does not have the means, or the interest, to carry them out. … everything is tied together, which makes our overall task so interesting – and difficult. What we must have is the ability to combine the overall view with its decomposition into manageable pieces. No one perspective – the global and the sectorial – is really fruitful if taken in isolation. A strategic and visionary policy has to be debated, designed and adopted for the next few years, if we hope to be successful.”

Collaborative construction of LRs is linked to and is an evolution of both the reusability notion and the integration requirement.

Language Technology as a data intensive field: the data-driven approach

LRs were not conceived as an end in themselves, but as an essential component to develop robust systems and applications. They were the obvious prerequisite and the critical factor in the emergence and the consolidation of the data-driven approach in human language technology. Today we recognise that Language Technology is a data-intensive field and that major breakthroughs have stemmed from a better use of more and more Language Resources.

From Murray appeal, through Corpus-based lexicography, back to collaborative work!

In the ‘90s computer-aided corpus-based lexicography became the “normal” lexicographic practice for the identification and selection of documentation – through text-processing methods, frequency lists, patterns spotting, context analysis, and so on. No need to ask for 10,000 contributors!

Data-driven methods and automatic acquisition of linguistic information started in the late ‘80s with the ACQUILEX project, aiming at acquiring lexical information from so-called machine-readable dictionaries. The needs of “language industry” applications compelled to rely on actual usage of languages, as attested in large corpora, for acquiring linguistic information, instead of relying on human introspection as the source of linguistic information and of testing linguistic hypotheses with small amounts of data. This meant developing statistical techniques, machine learning, text mining, and so on.

All this was/is very successful, but all these techniques on one side rely on bigger and bigger collections of data (LRs), possibly annotated in many ways and often with human intervention, and on the other side they are never 100% correct, thus requiring again human intervention. Therefore, if more and bigger (processed) LRs are needed, if statistical techniques arrive at a certain limit, new ways to cope with this need of “Big Data” must be found and explored. Natural ways of coping with the big data paradigm and the need of accumulation of extremely large (linguistic) knowledge bases are:

  • collaborative building of resources on one side, and
  • putting again human intelligence in the loop on the other side, recognising that some tasks are better performed by humans: crowdsourcing as a form of global human-based computation.

Collaborative building vs. crowdsourcing can be paralleled to the difference between involvement and contribution of colleagues (as in the EC projects above) vs. involvement of the layman/everyone (as in Murray appeal). Even if both can be said to rely on collective intelligence or on the “wisdom of the crowd”, they clearly represent quite different approaches and methodologies and require different organisations.

Language Resources and the Collaborative framework: to achieve the status of a mature science

The traditional LR production process is too costly. A new paradigm is pushing towards open, distributed language infrastructures based on sharing LRs, services and tools. Joining forces and working together on big experiments that collect thousands of researchers is – since many years – my dream, what I think is the only way for our field to achieve the status of a mature science.

It is urgent to create a framework enabling effective cooperation of many groups on common tasks, adopting the paradigm of accumulation of knowledge so successful in more mature disciplines, such as biology, astronomy, physics. This requires enabling the development of web-based environments for collaborative annotation and enhancement of LRs, but also the design of a new generation of multilingual LRs, based on open content interoperability standards. The rationale behind the need of open LR repositories is that accumulation of massive amounts of (high-quality) multi-dimensional data about many languages is the key to foster advancement in our knowledge about language and its mechanisms. We must finally be coherent and take concrete actions leading to the coordinated gathering – in a shared effort – of as many (processed/annotated) language data as we are collectively able to produce. This initiative compares to the astronomers/ astrophysics’ accumulation of huge amounts of observation data for a better understanding of the universe.

Consistently with the vision of an open distributed space of sharable knowledge available on the web for processing, the “multilingual Semantic Web” may help in determining the shape of the LRs of the future and may be crucial to the success of an infrastructure – critically based on interoperability – aimed at enabling/improving sharing and collaborative building of LRs for a better accessibility to multilingual content. This will serve better the needs of language applications, enabling building on each other achievements, integrating results, and having them accessible to various systems, thus coping with the need of more and more ‘knowledge intensive’ large-size LRs for effective multilingual content processing. This is the only way to make a giant leap forward.

Relations with other dimensions relevant to the LR field

In the “FLaReNet Final Blueprint”, the actions recommended for a strategy for the future of the LR field are organised around nine dimensions: a) Infrastructure, b) Documentation, c) Development, d) Interoperability, e) Coverage, Quality and Adequacy, f) Availability, Sharing and Distribution, g) Sustainability, h) Recognition, i) International Cooperation. Taken together, as a coherent system, these directions contribute to a sustainable LR ecosystem.

Let’s not forget that the same requirements apply whatever the method of LR building: collaboratively built resources undergo the same rules/recommendations. An implication of collaboration is that interoperability acquires even more value. The same is true for sustainability, for data infrastructure enabling international collaboration, and also for notions such as authority and trust. Moreover, when collaborative building is explicitly performed, there is the need to better define all the small steps inside an overall methodology. These recommendations could be taken as a framework in which to insert our future work strategy also in the collaborative paradigm.

Let’s organise our future!

One of the challenges for the collaborative model to succeed will be to ensure that the community is engaged at large! This can also be seen as an effort to push towards a culture of “service to the community” where everyone has to contribute. This “cultural change” is not a minor issue. This requirement was for example at the basis of the LRE Map idea, a collaborative bottom-up means of collecting metadata on LRs from conference authors, contributing to the promotion of a large movement towards an accurate and massive bottom-up documentation of LRs (see also with metadata for about 4000 LRs from many conferences)

My final remark is that, as with any new development, it is important on one side to leave space to the free rise of new ideas and methods inside the collaborative paradigm, but is also important to start organising its future. There must be a bold vision and an international group able to push for it (with both researchers and policy makers involved) and to organise some grand challenge that, via a distribution of efforts and exploiting the sharing trend, involves the collaboration of a consistent portion of our community. Could we envision a large “Language Library” as the beginning of a big Genome project for languages, where the community collectively deposits/creates increasingly rich and multi-layered LRs, enabling a deeper understanding of the complex relations between different annotation layers/language phenomena?

Pisa, Italy Nicoletta Calzolari


In the last years, researchers from a variety of computer science fields including computer vision, language processing and distributed computing have begun to investigate how collaborative approaches to the construction of information resources can improve the state-of-the-art. Collaboratively constructed language resources (CCLRs) have been recognized as a topic of its own in the field of Natural Language Processing (NLP) and Computational Linguistics (CL). In this area, the application of collective intelligence has yielded CCLRs such as Wikipedia, Wiktionary, and other language resources constructed through crowdsourcing approaches, such as Games with a Purpose and Mechanical Turk.

The emergence of CCLRs generated new challenges to the research field. Collaborative construction approaches yield new, previously unknown levels of coverage, while also bringing along new research issues related to the quality and the consistency of representations across domains and languages. Rather than a small group of experts, the data prepared by volunteers for knowledge construction comes from multiple sources, experts or non-experts with all gradations in-between in a crowdsourcing manner. The resulting data can be employed to address questions that were not previously feasible due to the lack of the respective large-scale resources for many languages, such as lexical-semantic knowledge bases or linguistically annotated corpora, including differences between languages and domains, or certain seldom occurring phenomena.

The research on CCLRs has focused on studying the nature of resources, extracting valuable knowledge from them, and developing algorithms to apply the extracted knowledge in various NLP tasks. Because the CCLRs themselves present interesting characteristics that distinguish them from conventional language resources, it is important to study and understand their nature. The knowledge extracted from CCLRs can substitute for or supplement customarily utilized resources such as WordNet or linguistically annotated corpora in different NLP tasks. Other important research directions include interconnecting and managing CCLRs and utilizing NLP techniques to enhance the collaboration processes while constructing the resources.

CCLRs contribute to NLP and CL research in many different ways, as demonstrated by the diversity and significance of the topics and resources addressed in the chapters of this volume. They promote the improvement of the respective methodologies, software, and resources to achieve deeper understanding of the language, at the larger scale and more in-depth. As the topic of CCLRs matures as a research area, it has been consolidated in a series of workshops in the major CL and artificial intelligence conferences (People's Web Meets NLP workshop series at ACL-IJCNLP 2009, COLING 2010, and ACL 2012) and a special of issue of the Language Resources and Evaluation journal. Besides, the community produced a number of widely used tools and resources. Examples of them include word sense alignments between WordNet, Wikipedia, and Wiktionary (,, folksonomy and named entity ontologies, multiword terms (, ontological resources (,, annotated corpora ( and Wikipedia and Wiktionary APIs (JWPL:, wikixmlj:, JWKTL:

Purpose of This Book

The present volume provides an overview of the research involving CCLRs and their applications in NLP. It draws upon the current great interest in collective intelligence for information processing in general. Several meetings have taken place at the leading conferences in the field, and the corresponding conference tracks, e.g. “NLP for Web, Wikipedia, Social Media" have been established. The editors of this volume, thus, recognized the need to summarize the achieved results in a contributed book to advance and focus the further research effort. In this regard, the subject of the book “The People’s Web Meets NLP: Collaboratively Constructed Language Resources” is very timely. There is no monograph, textbook or a contributed book on this topic to comprehensively cover the state-of-the-art on CCLRs in a single volume yet. Thus, we very much hope that such a book will become a major point of reference for researchers, students and practitioners in this field.

Book Organization

The chapters in the present volume cover the three main aspects of CCLRs, namely construction approaches to CCLRs, mining knowledge from and using CCLRs in NLP, and interconnecting and managing CCLRs.

Part 1: Approaches to Collaboratively Constructed Language Resources

Collaboratively constructed resources have different forms and are created by means of different approaches, such as collaborative writing tools, human computation platforms, games with a purpose, or collecting user feedback on the Web.

Some of them are constructed by applying Social Web tools, such as wikis, to existing forms of knowledge production. For example, Wikipedia was created through the use of wikis to construct an electronic encyclopedia. In a similar way, Wiktionary was created through the use of wikis to construct a user-generated dictionary. Major research questions in this area of research are: how to utilize a Social Web tool to come up with a useful resource, motivating users to contribute, how to extract the knowledge, quality issues, varied coverage, or incompleteness of the resulting resources.

Further CCLRs result from the purposeful use of human computation platforms on the Web, such as Amazon Mechanical Turk, to perform expert-like or highly subjective tasks by a large number of non-expert volunteers paid for their work. Thereby, a complex task is typically modeled as a set of simpler tasks solved by means of a web-based interface. In other settings, platforms for collaborative annotation by non-paid peers may be used to construct language resources collaboratively. Major research questions in this context are, for example, how to model a complex task in such a way that it is feasible to be solved by non-experts, how to prevent spam, or monetary, quality and labor management issues.

The third approach to the construction of CCLRs by means of crowdsourcing is modeling the data management tasks, such as data collection or data validation as a game. The players of such a game contribute their knowledge collectively either for fun, or for learning purposes. These works address research questions such as how to convert the task into a game, how to motivate players for continuous participation, and how to manage the quality of the resulting data.

Part 2: Mining Knowledge From and Using Collaboratively Constructed Language Resources

Much effort have been put into utilizing CCLRs in various NLP tasks and demonstrating their effectiveness.

The present volume includes a number of examples for research works in this area, specifically, construction of semantic networks, word sense disambiguation, computational analysis of writing, or sentiment analysis.

The first approach to mining knowledge from CCLRs is to construct or improve semantic networks.

There exist manually constructed semantic resources such as WordNet and FrameNet. Resources constructed through collective intelligence such as Wikipedia, Wiktionary, and Open Mind Common Sense ( can provide rich and real-world knowledge at large scale that may be missing in manually constructed resources. In addition, combining resources that are complementary in coverage and granularity can yield a higher quality resource.

The second approach to utilizing CCLRs is mining the vast amount of user-generated content in the Web to create specific corpora which can be used as resources in computational intelligence tasks. Much of this data implicitly carries semantic annotations by users, as the corpora typically evolve around a certain domain of discourse and therefore represent its inherent knowledge structure. NLP applications exemplified in this book include the computational analysis of writing using Wikipedia revision history, organizing and analyzing consumer reviews, and word sense disambiguation utilizing Wikipedia articles as concepts.

The applications of CCLRs in NLP are certainly not limited to the example topics explained in this book; one can find a large number of research works with similar goals and approaches in the literature.

Part 3: Interconnecting and Managing Collaboratively Constructed Language Resources

Readily available technology and resources such as Amazon Mechanical Turk and Wikipedia have lowered the barriers to collaborative resource construction and its enhancements. They also have led to a large number of sporadic efforts creating resources in different domains and with different coverage and purposes. This often results in resources that are disparate, poorly documented and supported, with unknown reliability. That is why the resources run the risk of not extensively being used by the community and can therefore disappear very quickly.

The research question is then how to create linguistic resources, expert-built and collaboratively constructed alike, more sustainable, such that the resources are more usable, accessible, and also easily maintained, managed, and improved.

In this part of the book, a number of ongoing community efforts to link and maintain multiple linguistic resources are presented. Considered resources range from lexical resources to annotated corpora. The chapters of the volume also introduce special interest groups, frameworks, and ISO standards for linking and maintaining such resources.

Target Audience

The book is intended for advanced undergraduate and graduate students, as well as professionals and scholars interested in various aspects of research on CCLRs.


We thank all program committee members who generously invested their expertise and time for providing constructive reviews. This book would not have been possible without their support, especially considering the tight schedule and multiple review rounds. We also thank Nicoletta Calzolari for her insightful and inspiring foreword.

Iryna Gurevych and Jungi Kim