Overview of the project

In this research project, we aim to address the following research question: “What are the primary reasons that text descriptions of mobile apps fail to refer to the use of privacy-sensitive resources?”

To answer the research question, we developed a framework called ACODE (Analyzing COde and DEscription), which combines static code analysis and text analysis. We developed light-weight techniques so that we can handle hundred of thousands of distinct text descriptions. We note that our text analysis technique does not require manually labeled descriptions; hence, it enables us to conduct a large-scale measurement study without requiring expensive labeling tasks.

Our analysis of 200,000 apps and multilingual text descriptions collected from official and third-party Android marketplaces revealed four primary factors that are associated with the inconsistencies between text descriptions and the use of privacy-sensitive resources:

existence of app building services/frameworks that tend to add API permissions/code unnecessarily,
existence of prolific developers who publish many applications that unnecessarily install permissions and code,
existence of secondary functions that tend to be unmentioned, and
existence of third-party libraries that access to the privacy-sensitive resources.

We believe that these findings will be useful for improving users’ awareness of privacy on mobile software distribution platforms.

Dataset

We share the ACODE dataset with the research community. The dataset consists of text descriptions of 200K Android apps. They are written in both English and Chinese, which are collected from official Google Play and third-party market, respectively. It also consists of manually labeled text descriptions. A label indicates whether a text description refers to use of a given permission or not. We used the labeled data for verifying performance of our keyword-based text classifier. In addition to the raw text descriptions, we also share extracted keywords, which can be used to classify text descriptions. Using the keywords, other researchers should be able to reproduce our results easily. Here is the list of data:

- Text descriptions of Android apps
  - 100,000 of text descriptions in English (JSON format)
  - 100,000 of text descriptions in Chinese (JSON format)

- Labeled Android text descriptions
  - 3,000 of labeled text descriptions in English (1,000 x 3 permissions, JSON format)
  - 3,000 of labeled text descriptions in Chinese (1,000 x 3 permissions, JSON format)

- Extracted keywords that can be used to classify text descriptions
  - Top 10 words for each of 11 permissions in English
  - Top 10 words for each of 11 permissions in Chinese

- Extracted domain-specific stop words
  - 100 stop words in English
  - 100 stop words in Chinese

If you are interested in accessing to the ACODE dataset, please send us an email from your university’s or company’s email account.

Email: acode@nsl.cs.waseda.ac.jp

In your email, please include your name, affiliation, and your purpose to use the dataset. We use the information for verification. If you are a student, please indicate the name of your supervisor and her/his affiliation.
If your papers or articles use our dataset, please cite our SOUPS 2015 paper below.

Publication

T. Watanabe, M. Akiyama, T. Sakai, H. Washizaki, and T. Mori, “Understanding the Inconsistency between Behaviors and Descriptions of Mobile Apps,” IEICE Transactions on Information and Systems, Vol. Vol.E101-D, No. 11, pp. 2584–2599, November 2018. [PDF]

Acknowledgements

A part of this work was supported by JSPS Grant-in-Aid for Scientific Research B, Grant Number JP16H02832.