In community live chatting, developers are likely to post issues they encountered (e.g., setup issues and compile issues), and other developers respond with possible solutions. Finally, we discuss the potential directions for future research enabled by labeled Gitter datasets such as GitterCom.Ĭollaborative live chats are gaining popularity as a development communication tool. We found that Decision Trees and Random Forest performed the best, achieving an accuracy of 88%, which is very promising for this multi-class classification task. Further, in an effort to automate the labeling process, we investigate the accuracy of 9 traditional machine learning and deep learning algorithms in predicting the intent of Gitter messages. (2016) found on Slack through surveys are applicable to developer messages exchanged on Gitter.
We then present a qualitative study to understand the extent to which the categories identified in previous work by Lin et al.
In this paper, we first describe the largest manually labeled and curated dataset of Gitter developer messages, named GitterCom, obtained by manually analyzing and labeling 10,000 Gitter messages in 10 software projects.
Uncovering what developers are communicating about through Gitter is an essential first step towards successfully understanding and leveraging this information. Among these platforms, Gitter has emerged as a popular choice and the messages it contains can reveal important information to researchers studying open source software systems. Software developers are often using instant messaging platforms to communicate with each other and other stakeholders. We discuss how these data provide a laboratory to test theories from standard organizational science in large open source projects. We then explore correlations between the valence of social messaging and the structure of the collaboration network. To showcase the usefulness of these data, we focus on the CPython repository and merge the technical layer (which GitHub account works on what file and with whom) with the social layer (messages from unique email addresses) by identifying 33% of GitHub contributors in the mailing list data.
We share all scraping and cleaning code to facilitate reproduction of this work, as well as smaller datasets for the Golang (122,721 messages), Angular (20,041 messages) and Node.js (12,514 messages) communities. Here, we combine and standardize mailing lists of the Python community, resulting in 954,287 messages from 1995 to the present. Multimodal tool use, with software development and communication happening on different channels, complicates the study of open source projects as a sociotechnical system. Historically, large communities often used a collection of mailing lists to discuss the different aspects of their projects. These demonstrate the significant potential of applying BugListener in community-based software development, for promoting bug discovery and quality improvement.Ĭommunication surrounding the development of an open source project largely occurs outside the software repository itself. A human evaluation also confirms the effectiveness of BugListener in generating relevant and accurate bug reports. The results show that: for bug report identification, BugListener achieves the average F1 of 74.21%, improving the best baseline by 10.37% and for bug report synthesis task, BugListener could classify the OB, EB, and SR sentences with the F1 of 67.37%, 87.14%, and 65.03%, improving the best baselines by 7.21%, 7.38%, 5.30%, respectively. BugListener is evaluated on six open source projects. Specifically, BugListener automates three sub-tasks: 1) Disentangle the dialogs from massive chat logs by using a Feed-Forward neural network 2) Identify the bug-report dialogs from separated dialogs by modeling the original dialog to the graph-structured dialog and leveraging the graph neural network to learn the contextual information 3) Synthesize the bug reports by utilizing the TextCNN model and Transfer Learning network to classify the sentences into three groups: observed behaviors (OB), expected behaviors (EB), and steps to reproduce the bug (SR). In this paper, we first formulate the task of identifying and synthesizing bug reports from community live chats, and propose a novel approach, named BugListener, to address the challenges. However, it remains a challenging task to accurately record such knowledge due to the noisy nature of interleaved dialogs in live chat data.
In community-based software development, developers frequently rely on live-chatting to discuss emergent bugs/errors they encounter in daily development tasks.