Customer Interaction Data of German Emails and Online Requests

General

This dataset was created within the EU-funded project EXCITEMENT (EXploring Customer Interactions through Textual EntailMENT) to evaluate the task of automatically categorizing German customer requests. The dataset consists of a set emails and online requests sent to the support center of a multimedia software company. The customer requests contain issues reported by customers concerning the company's products. Each email was manually assigned to one or more matching category. Each category represents an issue reported by one or several customers.

Anonymization

In order to eliminate all information that would allow tracing back to the software company or to individual people, all original requests were modified by
  1. Transforming the product domain of the real company to a different product domain of an imaginary company called WAREHOUSE. WAREHOUSE provides management software for online auction sales. This domain transformation required the modification (and partially, the deletion) of all hints to the original product, such as product names, software functions, or system logs.
  2. Anonymizing personal data, e.g., names or addresses of customers and employees.

Content

The dataset contains two files:
  1. omq_public_emails.xml: This file contains the list of emails. Each email consists of the email text along with some meta-information. In each email text, one or more relevant text parts are marked (i.e., the part of the text containing the main issue(s) described by the customer), together with the ID of the category, to which the respective issue has been assigned. The file contains 627 emails, in which 638 relevant texts are marked.
  2. omq_public_categories.xml: This file lists all categories assigned to the emails in the email dataset. As the categories do not exclude each other in every case, they are combined into category groups of similar categories. Each category consists of an ID and a text description of the category issue. The dataset contains 41 categories arranged in 20 groups.

Dowload Dataset

The email dataset is available for German and is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

You can dowload the dataset from here.

An analysis of the dataset is described in this paper. If you use the resource, please cite the paper as follows:

@inproceedings{eichler_etal_2014,
  author = {Eichler, Kathrin and Gabryszak, Aleksandra and Neumann, G{\"u}nter},
  title = {An analysis of textual inference in German customer emails},
  booktitle = {Proceedings of the Third Joint Conference on Lexical and Computational Semantics},
  series = {{*SEM-14}},
  year = {2014},
  location = {Dublin, Ireland},
}

RTE-style dataset

The RTE-style dataset contains text-hypothesis (TH) pairs created semi-automatically from the above customer interaction dataset. For building the dataset, all categories were first manually grouped into sets of semantically similar categories. For each customer interaction, we then created a set of TH pairs: One positive entailment pair with T corresponding to the relevant text part of the interaction and H corresponding to the description of the associated category, and a set of negative entailment pairs with T corresponding to the relevant text part of the interaction and H corresponding to the description of a non-matching category. As non-matching categories, we considered all categories that were neither directly associated to the interaction nor considered semantically similar to the associated category. When splitting the dataset into a training and a test part, we made sure that the distribution of categories is similar in both parts of the dataset, and that for each category there is at least one positive TH pair in either part of the dataset. In the unbalanced version of the dataset, the average number of negative pairs per positive pair is 38. In the balanced version of the dataset, there exists exactly one negative TH pair for each positive TH pair.

The RTE-style dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

You can dowload the dataset from here.

Other

The customer data underlying this dataset was collected between 01.01.2011 and 01.01.2012.

The dataset is the result of a joint effort of: