B.2 Privacy-preserving Data Analytics in OSNs (Aidmar)

B.2 Privacy-preserving Data Analytics in OSNs

- Aidmar Wainakh -

The number of people using online social networks (OSNs) continues to grow year after year. One of the reasons for that is the free services. The OSNs provide users with main functionalities for free. However, the dominant OSNs, such as Facebook, are functioning in centralized mode, which means that all the users data is in the service providers’ (SP) hand. The SPs use this data to make revenue.

Collecting such untold amount of data about people by SPs raises serious concerns about individuals privacy for several reasons. First, SPs sell the users data to third parties (e.g., advertisers). Second, SPs comply with the law of the countries where they operate, so they may provide data to intelligence or government agencies under specific circumstances. Third, SPs’ ability to protect the data is questionable (e.g., the data of 50 million Facebook users was harvested by Cambridge Analytica in 2017).

Within subproject B.2 of RTG 2050, we focus on enhancing the privacy aspect in OSNs by giving the users the ability to control their own data. For that, we propose the concept of hybrid OSN (HOSN), where users keep using the centralized OSNs but in a way that allows them to specify what data to give to SPs and in which accuracy. By that, the users will be able to preserve their privacy to the level they desire. However, realizing this concept requires to consider the financial sustainability of the SPs. In other words, the data that is provided by the users should be sufficient to keep the business model of the SPs running. For that, we address the following three research problems.

a) Anonymous Communication: To make the data control shared between an SP and users, we need to provide techniques for distributed anonymous communication. These techniques should be interwoven with features and functionalities of the current OSNs and form a HOSN. The anonymous communication system has to enable efficient communication, which is considered a challenge in distributed networks without shareable knowledge. Moreover, the system has to remain efficient even when it is confronted with dynamic user behavior, such as churn.

b) Users’ privacy awareness: To exploit the ability to communicate anonymously, the user has to know her actual privacy state. Therefore, we need to develop measures to quantify users’ privacy, based on their communication network.

c) Privacy-preserving data access: An SP has to have a functional business model to provide services; therefore, we need to develop a system to hand data to the SP without harming the users’ privacy. This might be done in three different ways. First, providing SPs with anonymized version of the users data, where users communicate with each other to aggregate and then anonymize their data. Second, users can build collaborative models out of their data, and use these models to generate synthetic data. Third, users can train machine learning models using their data and provide these models to SPs instead of the raw data. Knowing that, the machine learning techniques are heavily used by SPs for data analysis purposes.

Tandem partner: A.1, A.3