Deutsch (Deutschland) English (United States)

Application of NEPS data in teaching

Currently, the NEPS data are only available as Scientific Use Files (SUF). To use the data, all participants have to sign an extensive data use agreement that restricts the circle of users, the purpose of use and the duration of use (for further information about the data use agreements, see here).
To enable the data access to a larger group of interested persons, without including each of them in a data use agreement, LIfBi provides the so-called teaching or campus data. These are strongly anonymized data extracts from the Scientific Use Files that are completely anonymous. This means that an identification of the single individuals, households or institutions is impossible.
In the current pilot stage the Research Data Center provides these completely anonymous data to data users for teaching purposes. The framework conditions for this are the following:

  1. The teacher must have signed a valid data use agreement.
  2. All course participants have to sign an attendance list.
  3. Additionally, the participants must to be made aware of the confidentiality of the data and handle the data with care. In particular the disclosing of the data to a third party (persons who are not participants of the course) is not allowed.
  4. In consultation with the teacher, the Research Data Center generates a completely anonymous dataset that is defined as follows:
    1. The starting point of all modifications is the corresponding Download-SUF that is already the most anonymized SUF.
    2. From this SUF a partial dataset with a few variables (10-15 variables) is created.
    3. The ID is changed such that a linkage to the initial SUF is not possible.
    4. A subsample is drawn from the SUF.
    5. To all categorical variables k-anonymity protection (more precisely 2-anonymity protection) is applied (see [1]). This means that in the generated dataset, every variable combination applies to at least k individuals such that the variable combination cannot be used as unique identifier for one specific person (Every person has k-1 statistical twins). For k-anonymity to be satisfied, a combination of aggregation and reduction of characteristic attributes is used.
    6. Metric data are perturbed. This means that the data are masked with an N(0,r*s) distributed additional noise.
The advantages of this data use method are that the data are completely anonymized and the conditions of use are binding but yet practicable. This ensures a maximum of data protection. The disadvantage however is that the data volume can be relatively low. Every generated dataset is only usable for a specific purpose (defined by the chosen variables). Furthermore, data that were collected over several episodes (spell data) cannot be anonymized as described above and thus cannot be provided. Moreover, it is not recommended to draw any causal conclusions since such strict anonymization doesn’t allow any statements about the internal validity.
Please contact the Research Data Center via fdz@lifbi.de, if you are interested in campus files.

[1] Sweeney, L. (2002). k-Anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557–570