Empir Software Eng
DOI 10.1007/s10664-015-9393-5
An in-depth study of the promises and perils of mining
GitHub
Eirini Kalliamvakou
1
· Georgios Gousios
2
·
Kelly Blincoe
1
· Leif Singer
1
· Daniel M. German
1
·
Daniela Damian
1
©SpringerScience+BusinessMediaNewYork2015
Abstract With over 10 million git repositories, GitHub is becoming one of the most
important sources of software artifacts on the Internet. Researchers mine the information
stored in GitHub’s event logs to understand how its users employ the site to collaborate
on software, but so far there have been no studies describing the quality and properties of
the available GitHub data. We document the results of an empirical study aimed at under-
standing the characteristics of the repositories and users in GitHub; we see how users take
advantage of GitHub’s main features and how their activity is tracked on GitHub and related
datasets to point out misalignment between the real and mined data. Our results indicate
that while GitHub is a rich source of data on software development, mining GitHub for
research purposes should take various potential perils into consideration. For example, we
show that the majority of the projects are personal and inactive, and that almost 40 % of
all pull requests do not appear as merged even though they were. Also, approximately half
Communicated by: Sung Kim and Martin Pinzger
!
Daniel M. German
Eirini Kalliamvakou
Georgios Gousios
Kelly Blincoe
Leif Singer
Daniela Damian
1
University of Victoria, Victoria, BC, Canada
2
Radboud University of Nijmegen, Nijmegen, Netherland
Author's personal copy
Empir Software Eng
of GitHub’s registered users do not have public activity, while the activity of GitHub users
in repositories is not always easy to pinpoint. We use our identified perils to see if they
can pose validity threats; we review selected papers from the MSR 2014 Mining Challenge
and see if there are potential impacts to consider. We provide a set of recommendations for
software engineering researchers on how to approach the data in GitHub.
Keywords Mining software repositories · git · GitHub · Code reviews
1Introduction
GitHub is a popular collaborative code hosting site built on top of the git version control
system. As of January 2014, it hosts over 10.6 million repositories.
1
It includes a variety of
features that encourage teamwork and continued discussion over the life of a project. GitHub
uses a “fork & pull” model where developers create their own copies of a repository and
submit requests when they want the project maintainer to pull their changes into the main
branch, thus providing an environment in which people can easily conduct code reviews.
Every repository can optionally use GitHub’s issue tracking system to report and discuss
bugs and other concerns. GitHub also contains integrated social features: users are able to
subscribe to information by “watching” projects and “following” other users, resulting in a
constant stream of updates about people and projects of interest. The system supports user
profiles that provide a summary of a person’s recent activity within the site, such as their
commits, the projects they forked or the issues they reported.
Promise I: GitHub is a rich source of software engineering research
Software engineering researchers have been drawn to GitHub due to its popularity, as
well as its integrated social features and the metadata that can be accessed through its
API.
To date, there has been a variety of research on GitHub and its community. Qualitative stud-
ies (Begel et al. 2013;Dabbishetal.2012;Marlowetal.2013;McDonaldandGoggins
2013;Phametal.2013;Gousiosetal.2015)havefocusedonhowdevelopersuseGitHubs
social features to form impressions of and draw conclusions about developer and project
activity to assess success, performance, and possible collaboration opportunities. Quantita-
tive studies (Gousios et al. 2014;TakhteyevandHilts2010;Thungetal.2013;Tsayetal.
2012) have attempted to systematically archive GitHub’s publicly available data and use it
to investigate development practices and network structure in the GitHub environment.
We conducted an exploratory online survey to assess why developers use GitHub and
how it supports them in working with others, as part of our research on collaboration on
GitHub (Kalliamvakou et al. 2014a). While analyzing the survey data, we noticed that
GitHub repositories were also used for purposes other than strictly software development:
many respondents were using repositories to archive data, to host personal projects with-
out any plans to collaborate on their work, or for activities outside of software engineering.
This signaled that there may be significant unseen perils in using GitHub data “as-is” for
software engineering research. The variety of repository contents and activity, as well as
developers’ intentions, can alter research conclusions if care is not taken to first establish
that the data fits the research purpose.
1
https://github.com/features
Author's personal copy
Empir Software Eng
Previous research has identified the potential for misinterpretation when mining data
from SourceForge (Howison and Crowston 2004). Furthermore, (Bird et al. 2009b)
described both the promises and perils associated with exploiting the information stored in
git, a decentralized version control system. Following this line of research, we formulated
the following research question:
RQ: What are the promises and perils of mining GitHub for software engineering
research?
This study highlights potential threats to validity for research that relies on GitHub as
the main source of data about software engineering development. We use insights gained
from a survey conducted with 240 GitHub users to identify potential perils, and we provide
evidence of these perils based on quantitative analysis of the
GHTorrent dataset as well as a
manual inspection of 434 GitHub repositories. We outline some analysis risks to avoid and
provide recommendations on how researchers can best use the data available from GitHub.
To demonstrate the usefulness of these perils, we analyze four papers from the MSR’14
Mining Challenge (Baysal and Gousios 2014). We describe how the perils appear in the
dataset used in this challenge and create potential validity threats to the results of these
papers.
This paper is an extended version of (Kalliamvakou et al. 2014b), published at the Inter-
national Working Conference on Mining Software Repositories, MSR 2014. The extension
falls into two categories. First, we have identified four additional perils: two are related
to the information about GitHub users and two are related to the information that GitHub
makes available. Second, we have assessed the perils’ potential impact as threats to validity
on selected studies.
2Background&RelatedWork
Since many of the projects hosted on GitHub are public, anyone with an Internet connection
can view the activity within those projects, including information about issues, pull requests,
commits, comments and subscriptions. The large amount of public data on GitHub and its
availability via an
API make it possible for researchers to easily mine project data. Various
tools and datasets have been created to assist researchers with this task.
2.1 Background
Due to the abundance and availability of data, code hosting services such as GitHub have
piqued the interest of many software engineering researchers. The public availability of
data from many projects simplifies the data collection and processing issues that are often
encountered in research. However, there are still practical difficulties that can potentially
alter conclusions drawn from the data.
Another popular code hosting site, SourceForge, peaked in popularity prior to GitHub’s
wide-spread adoption (Finley 2011). Howison and Crowston (2004) noted that projects
hosted on SourceForge were often abandoned and that their data was often contaminated
with data imported from previous systems. They also found that information was often
missing due to project data being hosted outside of the SourceForge space. Similarly,
Weiss (2005)concludedthatnotallSourceForgedataistobeconsideredperfect:names
of categories often change in SourceForge and projects are constantly initiated and then go
Author's personal copy
Empir Software Eng
inactive. By comparing his data to that of FLOSSMole,
2
Weiss highlighted that informa-
tion about inactive and inaccessible projects was missing altogether. Rainer and Gale (2005)
conducted an in-depth analysis of the quality of SourceForge data. They noted that only
1 % of SourceForge projects were actually active as indicated by their metrics. The authors
suggested caution when using SourceForge data and advised that the research community
should perform an evaluation of the quality of data taken from portals such as SourceForge.
In this paper, we present study findings highlighting potential risks for researchers to keep
in mind when drawing conclusions from GitHub data.
Other recent studies have identified biases in bug-fix datasets that can compromise the
validity and generalizability of studies using the datasets. Researchers often rely on links
between bugs and commits made in commit logs, but linked bugs represent only a fraction of
the entire population of fixed bugs. Bird et al. (2009a)foundthatthissetofbugsisabiased
sample of the entire population. Bachmann et al. (2010)foundthatthesetofbugsinabug
tracking system itself may be biased since not all bugs are reported through those systems.
Nguyen et al. (2010) discovered that similar biases exist even in commercial projects that
employ strict guidelines and processes. However, Rahman et al. (2013)showedthatalarge
sample size can counter the effects of bias. In our work, we show that bias exists across
large GitHub datasets and provide recommendations on how to avoid such biases.
Bird et al. (2009b) described the problems that mining git poses for software engi-
neering research. Their work demonstrated that the differences between centralized version
control systems (such as subversion)andgit created certain challenges for those using
git repositories for research.
2.2 Related Work
The introduction of social features in code hosting sites has drawn much attention from
researchers. Several qualitative studies have interviewed GitHub users to better under-
stand how these social features are being used (Begel et al. 2013;Dabbishetal.2012;
Marlow et al. 2013). Their findings indicate that GitHub users form impressions of and draw
conclusions about the activities and potential of developers and projects. Users then inter-
nalize those conclusions to decide whom and what to keep track of, or where to contribute
next. The transparency brought about by these social features also appears to allow teams to
maintain awareness of their members’ activity and use this towards organizing their work.
Pham et al. (2013) investigated whether the higher visibility of developer actions enabled by
GitHub’s social features has an influence on developers’ testing behaviors. Through inter-
views and an online survey, they highlighted the challenges of promoting a desirable testing
culture among contributors and suggested strategies for doing so.
Tsay et al. (2012) quantitatively studied the impact of GitHub’s social features on project
success on 5,000 projects. McDonald and Goggins (2013) interviewed GitHub users to iden-
tify how they measured success on their projects. Their study found that project members
see GitHub’s social features as the driver behind increased contribution.
Research on GitHub has extended beyond its social features. Thung et al. (2013)built
social networks of developers involved with 100,000 GitHub projects to demonstrate the
social structure of the GitHub ecosystem. Takhteyev and Hilts (2010)lookedatthegeo-
graphic locations of GitHub developers by examining self-reported location information
2
A collection of open source software data, formerly known as OssMole.
Author's personal copy
Empir Software Eng
available within GitHub profiles. Gousios et al. (2014)examinedhowpullrequestswork
on GitHub. They found that the pull request model offers fast turnaround, increased oppor-
tunities for community engagement, and decreased time to incorporate contributions. They
showed that a relatively small number of factors affect both the decision to merge a pull
request and the time to process it. They also qualitatively examined the reasons for pull
request rejection and found that technical reasons are a small minority. They also demon-
strated that many pull requests that appear to be unmerged in GitHub were actually merged.
This paper extends their work.
Several research projects have provided easier access to the data available through the
GitHub
API.TheGHTorrent (Gousios and Spinellis 2012) project provides a mirror of the
GitHub
API data, which it obtains by monitoring and recording GitHub events as they occur
and applying recursive dependency-based retrieval of the related resources. When run in
standalone mode,
GHTorrent can also retrieve the history of individual repositories. Gousios
and Zaidman (2014a) have combined this dataset with their research in Gousios et al. (2014)
to provide a dataset of pull requests for GitHub projects.
The GitHub archive (Grigorik 2012) provides a dataset of the history of events in
GitHub. It also obtains its data by monitoring the GitHub timeline. However, as the GitHub
archive started data collection in 2011, it is an incomplete mirror—GHTorrent, in compar-
ison, has retrieved the complete history of GitHub. Moreover, one can use tools such as
Gitminer (Wagstrom et al. 2013) to extract the history of events for specific repositories.
Gitminer crawls the GitHub API for any desired project and produces a graph dataset.
Tsay et al. (2014)minedthedatatheyneededfortheirresearchusingGitHubsAPI.They
described that in order to reach meaningful conclusions, they had to filter out the majority
of projects in GitHub because many were inactive, had very few contributors or did not use
GitHub’s issue tracking system.
Researchers have taken advantage of the large amount of data available from GitHub
and tools like the GHTorrent, the GitHub archive and Gitminer to perform studies across a
large number of projects. Studies have investigated testing patterns (Kochhar et al. 2013),
programming languages (Bissyande et al. 2013), issue reporting (Bissyande et al. 2013),
project success (Tsay et al. 2012), and more. Takhteyev and Hilts (2010)lookedatthe
geographic locations of GitHub developers by examining self-reported location information
available within GitHub profiles.
3StudyDesign
This paper describes an analysis that was motivated by our previous study of the GitHub
environment (Kalliamvakou et al. 2014a). The goal of that study was to examine how
GitHub is used for collaboration through surveys and interviews. We selected survey par-
ticipants from GitHub’s public event stream in May 2013, choosing recently active users
with public email addresses. Our survey was exploratory with open-ended questions asking
about reasons for using GitHub, how GitHub supports collaboration, managing dependen-
cies and tracking activity, as well as GitHub’s effect on the development process. We sent
our survey to 1,000 GitHub users and received 240 responses (24 % response rate). We
received several unexpected responses regarding the purpose of using GitHub. For example,
respondents noted they used GitHub for purposes other than code hosting or collaborative
development, such as for data storage, personal projects and class projects.
In an on-going project, we found that choosing which GitHub-based collaborative soft-
ware engineering projects to study was not a trivial task. Frequently, projects were empty,
Author's personal copy
Empir Software Eng
had very few files or had been inactive for a long time. It was also common to find
repositories where the only contributor was its owner.
These cases motivated our further analysis of the GitHub repository contents and of col-
laboration within GitHub, as discussed in Sections 4.3 and 4.4. We then quantitatively and
qualitatively analyzed the GitHub data to identify and measure the extent and frequency
of perils. For the purposes of this paper, we define a “peril” as a characteristic of the data
that can be retrieved from GitHub that can potentially threaten the validity of software
engineering research that uses such data.
Our process was divided as follows:
1. Quantitative analysis of project metadata. We used the
GHTorrent (Gousios and
Spinellis 2012)Jan2014
3
dataset for our study. GHTorrent is a comprehensive collec-
tion of GitHub repositories, their users, and their events (including commits, issues and
pull requests), as described in Section 1. We also used the MSR’14 Mining Challenge
Dataset (Baysal and Gousios 2014) and cloned many repositories in GitHub in order to
compare the
GHTorrent data with current repositories.
2. Manual analysis of a 434-project sample. To supplement our quantitative analysis, we
also performed an in-depth manual analysis on a random sample of 434 projects from
the 3 million projects that exist in the GHTorrent dataset (cf. Peril I in Section 4 for our
definition of a project). This sample size provides a confidence level of 95 % with a
±5 % confidence interval.
4Results
Using a mixed methods research approach, we identified thirteen perils that pose potential
threats to validity for studies involving software projects hosted in GitHub (Table 1 summa-
rizes them). In this section, we describe and provide supporting evidence for each peril, and
include recommendations on how to avoid them.
4.1 Repositories are Part of Projects
Peril I: A repository is not necessarily a project
The typical pull request development model (as used by GitHub) is a newer method for
collaborating in distributed software development (Gousios and Zaidman 2014b). With this
model, the project’s main repository is not writable by potential contributors. Instead, the
contributors fork (clone) the repository and make their changes independent of each other.
When a set of changes is ready to be submitted to the main repository, they create a pull
request which specifies a local branch to be merged with a branch in the main repository.
A member of the project’s core team (a committer of the destination repository) is then
responsible for inspecting the changes and pulling them into the project’s master branch.
If changes are considered unsatisfactory (e.g., as a result of a code review), more changes
may be requested. In this case, contributors need to update their local branches with the new
commits.
3
http://ghtorrent.org/downloads.html
Author's personal copy
Empir Software Eng
Tabl e 1 Summary of the perils discovered in our study
Peril Description
Project Related
IArepositoryisnot Aprojectistypically
necessarily a project part of a network of repositories: at least
one of them will be designated as central,
where code is expected to flow to and where
the latest version of the code is to be found.
II Most projects have low activity Most projects have very few commits.
III Most projects are inactive Most projects do not have recent activity
(only 13 % of projects have been active
in the last month).
IV Many projects are not software A large portion of projects are not used for
development software development activities.
VMostprojectsarepersonal More than two thirds of projects (71.6 % of
repositories) have only have one committer:
its owner.
VI Many active projects do not use Many active projects do not conduct all their
GitHub exclusively software development activities in GitHub.
VII Few projects use pull requests Only a fraction of projects use pull requests.
And of those that use them, their use is very
skewed.
Pull Requests Related
VIII Merges only track successful code If the commits in a pull request are reworked
(in response to comments), GitHub records
only the commits that are the result of the
peer review, not the original commits.
IX Many merged pull requests appear Only pull requests merged via the “Merge” button
as non-merged are marked as merged. But pull requests can also
be merged via other methods, such as using git
outside GitHub; in those cases, the pull-request
will not appear as merged.
User Related
XNotallactivityisdueto The activity in GitHub repositories is
registered users sometimes due to non-users; in some cases, the
activity of a user is not properly associated with
her account.
XI Only the user’s public activity Approximately half of GitHub’s registered users do
is visible not work in public repositories.
Author's personal copy
Empir Software Eng
Tabl e 1 (continued)
Peril Description
Github Related
XII GitHub’s API does not expose The GitHub API exposes either a subset of events or
all data entities, or a subset of the information
regarding the event or the entity.
XIII Github is continuously evolving GitHub continues to evolve and it has changed
some features and provided new ones. Similarly,
the projects evolve and are capable of changing their
own history.
Due to this popular development model, repositories can be divided into two types: base
repositories (ones that are not forks) and forked repositories. The activity in forked repos-
itories is recorded independently from their associated base repositories. Until a commit
is pulled into another repository, this commit appears only in the history of the recipient
repository.
4
Therefore, measuring the activity of a repository independently of its forked
repositories will ignore the non-merged activity of all of them as part of a single project.
For example, the Ruby on Rails project
5
has 8,327 forks (8,275 forks were made directly
from its base repository, with the remainder being forks of forks). Of the 50k commits in
the Rails repository, GHTorrent reports only 34k commits as having occurred in the Rails
base repository (rails/rails), and the remaining 16k originating in its forks. However, 11k
commits have been made in forks but have not been propagated to the base repository.
To properly account for all the activity of a software development team, in the rest of this
paper we aggregate all the activity of the base repository and its forks. Thus we use the term
project to refer to a base repository and its forks, and continue to use the term repository
to denote a GitHub repository (either a base repository or a fork).
Many, 3.0M (44 %), of the 6.8M public repositories in GitHub are base repositories.
Thus, these base repositories represent 3.0M different projects (only 0.6M of them have
been forked at least once). For the base repositories with at least one fork, their number of
forks is highly skewed: 80 % have one fork only and 94 % have at most 3 forks. However,
there are some repositories that are heavily forked: 4,111 base repositories have been forked
at least 100 times. The most forked repo is octocat/Spoon-Knife (22,865 forks), a GitHub
administered repository for users to test how forking works.
It is important to highlight that many forks can operate independently from the rest of
the project. For example, a fork could be used to develop customizations that are never
intended to be contributed back into the main project. However, it is difficult to determine if
arepositorythathasyettocontributetotheprojectwillorwillnotcontributeinthefuture.
Peril avoidance strategy To analyze a project hosted on GitHub, consider the activity in
both the base repository and all associated forked repositories.
4
GHTorrent associates a commit with the repository where it first sees it (table commits) and also links it to
all repositories this commit has appeared into (table repo
commits)
5
http://rubyonrails.org GitHub repository located at https://github.com/rails/rails.
Author's personal copy
Empir Software Eng
4.2 On the Activity of Projects
Commits are a strong reflection of the activity in GitHub—in all of GitHub, there are more
than 20 times more commits than pull requests or issues. Thus, we can measure the activity
of a project using two different proxies: by its number of commits and by the period in
which its commits are made.
Peril II: Most projects have low activity
We counted the number of commits per project—that is, the union of all the commits in
all the repositories of a given project. Figure 1 shows the cumulative distribution (which is
very skewed) with a median number of commits of only 6 and a maximum of 427,650 (these
calculations do not include projects with zero commits–398,244 projects, 13.3 %, had no
commits).
There is a large number of projects with little activity, and the most active projects
account for the majority of commits in GitHub. This is shown in the Lorenz curve in Fig. 2
that depicts the inequality of commits across the population of projects. The most active
2.5 % of projects account for the same number of commits as the remaining 97.5 % projects.
Peril Avoidance Strategy Consider the number of recent commits on a project to select
projects with an appropriate activity level. Avoid claims of generalization if your study
considers only very active projects, as these are only a small set of those hosted on GitHub.
Peril III: Most projects are inactive
Since many projects have few commits, it is likely that many will also be inactive.
Figure 3 shows the cumulative ratio of projects that have had activity during the last n
months. For instance, in the last 6 months (since July 9, 2013), only 54 % of the projects
0.00
0.25
0.50
0.75
1.00
1 2 3 4 5 10 20 30 50 100 200 1000
Number of commits in Project (log)
Cummulative Proportion of Projects
Fig. 1 Cumulative ratio of projects with a given number of commits (includes only projects with at least one
commit). Most projects have very few commits. The median number of commits per project is 6 and 90 % of
projects have less than 50 commits
Author's personal copy
Empir Software Eng
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
Proportion of Projects
Proportion of Commits
Fig. 2 Lorenz curve showing that a small number of projects account for most of the commits
were active. However, many projects were created during this period (34 % of all projects
in GitHub). Of the 1,958,769 projects that were created before July 9, 2013, only 430,852
(22 %) had at least one commit in the last 6 months.
Project activity can also be measured by comparing the date its first repository was cre-
ated in GitHub with the date of its last commit (as shown in Fig. 4). In this regard, the
median number of days a project is active is 9.9 days. 32 % of projects were active for 1 day,
suggesting that they are being used either for testing or for archival purposes. Only 38 %
were active for more than 1 month. However, many active projects continue to be active:
25 % of projects have at least 100 days of activity.
0.0
0.2
0.4
0.6
0.8
1.0
<1 <2 <3 <4 <5 <6 <9 <12 <24 <36 <48 <60
Months since last commit (log)
Cumulative Proportion or Projects
Fig. 3 Cumulative ratio of active projects during the last n months since Jan 9, 2014. The red line is the
proportion of projects created during the last n months. Approximately 54 % of projects have been active
in the last six months. Only 12.5 % of projects were active in the last month and 4 % of them were created
during that period
Author's personal copy
Empir Software Eng
0.00
0.25
0.50
0.75
1.00
1 2 3 4 5 10 20 30 50 100 200 1000 2000
Days of Activity Since Creation (log)
Cummulative Proportion of Projects
Fig. 4 Cumulative ratio of projects that had activity the last n days since their creation. The median number
of days is 9.9, with 25 % of projects at 100 or more days; only 32 % had activity less than 1 day after being
created
Peril Avoidance Strategy To identify active projects, consider the number of recent
commits and pull requests
4.3 On the Contents of Projects
Peril IV: Many projects are not software development
Our survey responses indicated that GitHub is used for various purposes besides soft-
ware development. 34 of our 240 respondents (14 %) said they use GitHub repositories for
experimentation, hosting their Websites, and for academic/class projects. About 10 % of
respondents use GitHub specifically for storage.
A repository’s purpose cannot be reliably and automatically identified from the project
metadata. We used the 434 randomly selected repositories to determine if GitHub reposito-
ries are used for software development or other purposes; this sample provides a confidence
level of 95 % with a ±5 % confidence interval. We reviewed the description of and files
associated with each repository and assigned an appropriate label to mark its contents, e.g,
“software library” or “class project” using standard qualitative coding techniques (Corbin
and Strauss 2008). Open coding was used to identify labels for each repository. The open
coding was performed by two individuals who each coded half of the repositories. After
the open coding, the two coders agreed upon a set of labels and used axial coding to aggre-
gate the labels to create exclusive categories of use. We defined the purpose of repositories
as “Software development” if their contents were files that are used to build tools of any
sort. This type of use included repositories of libraries, plugins, gems, frameworks, add ons,
etc. “Experimental” was the class of repositories containing examples, demos, samples, test
code and tutorial examples. Websites and blogs were classified under “Web”, and class and
research projects under Academic”. The “Storage” category included repositories that con-
tained configuration files (including “. files) or other documents and files for personal use,
such as presentation slides, resumes and such. Repositories that gave an error (404 This is
Author's personal copy
Empir Software Eng
Tabl e 2 Number of repositories per type of use for the manual inspection. These categories are mutually
exclusive
Category of use Number of repositories
Software development 275 (63.4 %)
Experimental 53 (12.2 %)
Storage 36 (8.3 %)
Academic 31 (7.1 %)
Web 25 (5.8 %)
No longer accessible 11 (2.5 %)
Empty 3(0.7%)
not the repository you are looking for.)weremarkedas“Nolongeraccessible.Reposito-
ries containing only a license file, a gitignore file, a README file, or no files at all were
placed in the category “Empty”. Table 2 shows our categories and the distribution of the
434 repositories.
In particular, the “Web” category has become an important use of GitHub. GitHub allows
its users to host Websites on its servers for free.
6
Repositories using this service typically
include github.io or github.com in their name. There are 73,745 projects with such names,
indicating the popularity of this free service.
Peril Avoidance Strategy When trying to identify which software development projects
to analyze, do not rely just on the types of files within the repositories, but also also review
descriptions and README files to ensure the projects fit the research needs.
4.4 On the Users Involved with Projects
Peril V: Most projects are personal
In our survey, respondents were asked if they used GitHub primarily for collaboration
with others or for personal use. 90 out of 240 respondents (38 %) answered that they used
GitHub mainly for their own projects and not with the intention of collaborating with oth-
ers. This response was a motivating factor to look into how much collaboration and social
interaction is taking place in GitHub projects.
git commits record both the author (who wrote the patch) and the committer (who
committed the patch to the repository). The committer is the person who has write access to
the repository. In GitHub, only 2.9 % of commits have an author who is not the committer.
We can evaluate if a project is personal by counting the number of different committers in
all the repositories of the project.
The number of committers per project is very skewed: 67 % of projects have only 1
committer, 87 % have 2 or less, and 93 % have 3 or less. As expected, repositories have
fewer committers than projects: 72 % have 1 committer, 91 % have 2 or less, and 95 % have
3 or less. The proportions are the same for numbers of authors. The number of committers
in our manual sample is similar: 65 % hand only one committer, 83 % two or less, and 90 %
three or less.
6
See http://pages.github.com/ for details.
Author's personal copy
Empir Software Eng
Even though GitHub is targeted towards social coding, these results indicate that most
hosted projects are used by one person only. It is very likely that a large proportion of
projects with only one committer are for experimental or storage purposes.
Peril Avoidance Strategy To avoid personal projects, consider the number of committers.
4.5 On the use of Non-GitHub Infrastructure
Peril VI: Many active projects do not use GitHub exclusively
It is difficult to identify whether the data in GitHub represents most (if not all) of the
visible activity of a development project. In other words, do projects in GitHub use other
forms of collaboration?
The survey responses indicated that project discussions take place outside GitHub. As
one of the respondents put it:
“Any serious project would have to have some separate infrastructure - mailing lists,
forums, irc channels and their archives, build farms, etc. [...] Thus, while GitHub
and all other project hosts are used for collaboration, they are not and cannot be a
complete solution.
This motivated us to investigate whether repositories use GitHub to host project code and
other content, but perform development and collaboration activities elsewhere.
This can be evaluated in several ways. One way is to determine if all the committers and
authors are users in GitHub. If a commit is made by someone who is not a GitHub user,
then GitHub records an email address as its committer rather than a GitHub user (see Peril
X Not all activity is due to registered users). In GitHub, 23 % of committers or authors of a
commit are not GitHub users. The likely reason for this result is that some git operations
from non-users have been merged outside GitHub and it is exacerbated by mirrors set up to
track activity in repositories outside GitHub.
Mirrors are replicas of the code hosted in another repository. In some cases, a mirror
project clearly indicates that GitHub is not to be used for submission of code. For example,
the project postgres-xc/postgres-xc states in its description “Mirror of the official Postgres-
XC GIT repository. Note that this is just a *mirror* - we don’t accept pull requests on
github.... Nonetheless, this project has 14 different forks.
We identified many repositories that are mirrorsGitHub officially maintains 91 mirrors
of many popular projects
7
.Typically,thedescriptionofarepositorystatesifitisamir-
ror. For example, the description of repository abishekk92/voipmonitor reads A mirror of
the SVN repo at https://voipmonitor.svn.sourceforge.net/...”. Descriptions can also indicate
whether the mirror is automatic and note its frequency of update (e.g., “Mirror of official
clang git repository located at http://llvm.org/git/clang. Updated hourly.”).
The case-insensitive regular expression mirror of .
*
repo|git mirror of finds
1,739 projects (12,709 repositories) as mirrors of repositories outside GitHub. The median
number of commits is 52. Some of there repositories had a lot of activity: 78 had more
than 1,000 commits (1.4 % of all repos with at least 1,000 commits). We examined 100
7
https://github.com/mirrors
Author's personal copy
Empir Software Eng
Tabl e 3 Repositories hosted on GitHub labeled as mirrors. GitHub hosts mirrors from many sources, includ-
ing SourceForge and Bitbucket. The bottom section shows subsets of the top section. Regular expressions
are case insensitive
Set Used regular expresion Projets Repos
Mirror of mirror of .
*
repo|git mirror of 1,851 12,709
Subsets
Located on Sourceforge sourceforge|sf
_
net 117 511
Located on Bitbucket bitbucket 91 249
From subversion repos \W(svn|subversion)\ W 622 4966
From mercurial repos \ W(mercurial|hg)\ W 113 590
From CVS repos \Wcvs \ W 55 212
of these repositories and found that all of them were external mirrors. We identified many
mirrors from SourceForge repositories and Bitbucket (a competing git repository hosting
service)—these results are summarized in Table 3.
The implications of these results is that part of the development of a project happens in
GitHub, but not necessarily all.
The identification of development work occurring within mirrors hosted on GitHub
implies that some members of a project are using GitHub for one of two purposes. One pur-
pose is to develop their work and later submit it to the external repository. For example, the
project Linux-Samsung located at kgene/linux-samsung (which, according to GitHub, has
no forks and is not a fork itself) regularly contributes commits to the Linux kernel (we have
observed 123 commits in Linus Torvald’s repository that originated here
8
). The second pur-
pose is to develop customizations of the original project for a different purpose, independent
of the original development team. In this category, we find multiple repositories that contain
variants of the kernel, such as 2.6.35 Kernel for Samsung Galaxy S series Phones or Kernel
2.6.35.7 modified for Dropad A8T and similar.
Some of the identified mirrors are, interestingly, from repositories that use other version
control systems, such as Mercurial, Subversion or CVS. This implies that, in some cases,
contributors prefer git over these other version control systems to do their daily work, but
this needs further research to be confirmed. Similarly, many projects use their own defect
tracking systems to handle issues. For example, Mozilla’s Gaia (mozilla-b2g/gaia), one of
the most active projects in GitHub, has disabled issue tracking in GitHub and expects users
to file issues through bugzilla.mozilla.org.
We sent an additional questionnaire to 100 GitHub users, which asked the user whether
they used GitHub’s tools or an external toolset for specific tasks, such as opening and merg-
ing pull requests, tracking issues, or for communication. The survey was sent via email, and
we received 27 responses (27 % response rate). Even though 52 % said they use GitHub to
open pull requests and 60 % said they use the site to accept and merge code changes, only
24 % said they use GitHub for code reviews. 32 % said they use an external tool for reviews.
This further validates that all software development activities do not occur within GitHub
itself for many projects.
8
We currently track all sources of commits in the Linux kernel: hydraladder.turingmachine.org
Author's personal copy
Empir Software Eng
Peril Avoidance Strategy Avoid projects that have a high number of committers who
are not registered GitHub users and projects with descriptions that explicitly state they are
mirrors.
4.6 On Pull Requests
Promise II: GitHub provides a valuable source of data for the study of code reviews in
the form of pull requests and the commits they reference
The “Fork & Pull” development model was made popular on GitHub, but pull requests
are not unique to GitHub. In fact, git includes the git-request-pull utility which
provides the same functionality at the command line. GitHub and other code hosting sites
improved this process significantly by integrating code reviews, discussions and issues, thus
effectively lowering the entry barrier for casual contributions. Forking and pull requests
create a new development model where changes are pushed to the project maintainers and
go through code review by the community before being integrated.
Peril VII: Few projects use pull requests
The use of pull requests is not very widespread across all GitHub projects. Pull requests
are only useful between developers, and therefore, are non-existent in personal projects
(67 % of projects, see Peril V Most projects are personal). Of the 2.6 million GitHub projects
that represent actual collaborative projects (at least 2 committers), only 268,853 (10 %) used
the pull request model at least once; it is likely that the remaining 2.4M projects are using a
shared repository model exclusively (with no incoming pull requests) where all developers
are granted commit access. Moreover, the distribution of pull requests among projects is
highly skewed, as can be seen in Fig. 5.Themediannumberofpullrequestsperprojectis
2(44.7%ofprojectshaveonly1and95%have25orless).
Some projects receive a large number of pull requests. For example, in 2013 alone, the
Gaia phone application framework and the Homebrew package manager each received more
than 5,000 pull requests. In fact, a significant number of projects (1700) received more
than 100 pull requests in 2013. These projects can create a sample big enough to deliver
statistically significant results for many research questions.
Peril Avoidance Strategy When researching the code review process on GitHub, consider
the number of pull requests before selecting a project. Personal projects will rarely contain
pull requests.
4.6.1 Pull Requests as a Code Review Mechanism
Each GitHub pull request contains a branch (local or in another repository) from which a
core team member should pull commits. GitHub automatically discovers the commits to be
merged and attaches them to the pull request. By default, pull requests are submitted to the
destination repository for review. Any GitHub user can participate in the review. There are
two types of review comments:
Author's personal copy
Empir Software Eng
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
Proportion of Projects
Proportion of Pull Requests
0
1000
2000
3000
4000
00001001
Number of pull requests (log)
Number of projects
Fig. 5 Lorenz curve for the number of pull requests per project (left)andthecorrespondinghistogram
(right). The top 1.6 % of projects use 50 % of the total pull requests. These plots only include projects with
at least one pull request
Discussion: Comments on the overall contents of the pull request. Interested parties
engage in technical discussion regarding the suitability of the pull request as a whole.
Code Review: Comments on specific sections of the code. The reviewer makes notes
on the commit diff, usually of a technical nature to pinpoint potential improvements.
As a result of the review, pull requests can be updated with new commits or the pull
request can be rejected—either as redundant, uninteresting or duplicate. The exact reason a
pull request is rejected is not recorded, but can often be inferred from the comments.
With an update, the contributor creates new commits in the forked repository and, after
the changes are pushed to the branch to be merged, GitHub automatically updates the com-
mits in the pull request. The code review can then be repeated on the refreshed commits. In
our dataset of 434 projects, 17 % of the pull requests received an update after a comment
(discussion or code review). Care must be applied when interpreting this result as many
comments, especially in the discussion section, are merely expressions of gratitude for the
contributor’s work rather than a proper code review.
Pull request discussions are usually brief: 80 % of the pull requests have less than
3 comments (both code review and discussion). Moreover, the number of participants in
the code review ranges between 0 and 19, with 80 % of the pull requests having less
than 2 participants. The number of commits examined per peer review is less than 4 in
80 % of the pull requests. The numbers are comparable with other work on code review
(Rigby et al. 2008; Rigby and Bird 2013; Bacchelli and Bird 2013)whichsuggeststhatthe
peer review process may have more fundamental underpinnings yet to be explored. There-
fore, GitHub data may be a very good source of quantitative data for peer review due to
homogenization across various project repositories (provided the following shortcomings
are taken into consideration).
In many cases, a pull request’s code review is implicit and therefore not observable. Many
pull requests that were merged received no comments (46 % in our 434-project sample). It
is probably safe to assume that the developer that performed the merge did inspect the pull
request before merging it. Thus, a code review occurred, but there is no information about
Author's personal copy
Empir Software Eng
it except the fact that the code was merged (it is unlikely that a project will have a policy to
accept all pull requests without review).
Peril VIII: Merges only track successful code
It is also possible that the set of commits that were reviewed may not be readily
observable—and further processing could be required to recover them. Commonly, projects
require a commit squash (merging all different commits into a single one) before the set of
commits is merged with the main repository. While GitHub does record the intermediate
commits, it does not report them through its API as part of the pull request. Moreover, the
original commits are deleted if the source repository is deleted. This means that at the time
of analysis, the researcher can only observe the latest commit, which is the outcome of the
code review process.
Peril Avoidance Strategy To analyze the full set of commits involved in a code review,
do not rely on the commits reported by GitHub.
Peril IX: Many merged pull requests appear as non-merged
After a successful code review with a positive outcome, the pull request can be merged.
The versatility of git and GitHub enables at least three merging strategies:
–UsingtheMergebuttonwithinGitHub.
–Usinggit,bymergingthemainrepositorybranchandthepullrequestbranch.A
variation of this merge strategy is cherry-picking, where only specific commits from
the pull request branch are merged into the main branch.
–Bycreatingatextualpatchbetweenthepullrequestandmainrepositorybranchesand
applying it to the master branch. This is also known as commit squashing.
Depending on the selected merge strategy, the amount of history (commit order) and
authorship information preserved will vary. Specifically, merging through either git or
GitHub preserves full historical information—except in the case of cherry-picking where
only authorship is preserved. A patch-based merge does not maintain authorship or history.
Further, GitHub can only detect and report merges happening through its pull request
merge facilities. Therefore, if a project’s policy is to only merge using git,allpull
requests will be recorded as unmerged in GitHub. In practice, however, most projects use a
combination of GitHub and git merge strategies.
GitHub provides a way to streamline the closure of pull requests and issues via the con-
tents of the log of a commit. For example, if a commit log contains the string Fixes #321
and 321 is a pull request or an issue, then this pull request or issue is closed. Fixes is one
of nine keywords that can be used.
9
For example, the project homebrew/homebrew has had
13,164 pull requests opened, 12,966 closed, but only 129 merged. However, its logs show
9
For the entire list visit https://help.github.com/articles/closing-issues-via-commit-messages.
Author's personal copy
Empir Software Eng
that 6,947 pull requests (48 % of total) and 2,013 issues (19 %) have been closed from com-
mit logs. This shows that, at least in some projects, one cannot rely on GitHub’s Merged
attribute of a pull request.
Pull requests merged outside GitHub can be identified through a set of heuristics based
on conventions advocated by GitHub. The most important are presented below (for a full
description and evaluation of these heuristics, see (Gousios et al. 2014)).
H
1
At least one of the commits in the pull request appears in the target project’s master
branch.
H
2
Acommitclosesthepullrequestusingitslog(e.g.,ifthelogofthecommitincludes
one of the closing keywords, see above) and that commit appears in the project’s
master branch. This means that the pull request commits were squashed onto one
commit and this commit was merged.
H
3
One of the last three (in order of appearance) discussion comments contain a commit
unique identifier—this commit appears in the project’s master branch and the corre-
sponding comment can be matched by the following regular expression:
(?:merg|appl|pull|push|integrat)(?:ing|i?ed)
H
4
The latest comment prior to closing the pull request matches the regular expression
noted above.
Only 1,145,099 of 2,552,868 (44 %) pull requests are reported as merged across GitHub.
In the 434-project sample, only 37 % of the pull requests were merged using GitHub facil-
ities. By applying the heuristics presented above, an extra 42 % (H
1
:32%,H
2
:1%,H
3
:
5%,H
4
: 4 %) of pull requests are identified as merged, while 19 % cannot be classified. In
other work (Gousios et al. 2014), we used a carefully selected sample of 297 projects that
heavily relied on pull requests: 65 % of the pull requests were merged using GitHub facili-
ties, while the heuristics identified another 19 % (H
1
:7%,H
2
:1%,H
3
:3%,H
4
:7%)as
merged. In another dataset (Gousios and Zaidman 2014b)containingalmost1000projects
that use pull requests, 58 % of the pull requests were merged using GitHub’s facilities while
18 % are identified as unmerged. The remaining 24 % are identified as merged using the
heuristics (H
1
: 11 %, H
2
: 3 %, H
3
: 3 %, H
4
: 7 %) .
The heuristics proposed above are not complete (i.e., they may not identify all merged
pull requests) nor sound (i.e., they may lead to false positives, especially H
4
). In other
work (Gousios et al. 2014), we manually inspected 350 pull requests that were not identified
as merged and found that 65 of them were actually merged. This means the actual percentage
of merged pull requests may be even higher. The fact remains, however, that only a fraction
of merges are reported through GitHub, but heuristics can improve merge detection, in some
cases dramatically.
Peril avoidance strategy Do not rely on GitHub’s merge status, but consider using heuris-
tics (like the ones described above) to improve merge detection when analyzing merged pull
requests.
4.6.2 Pull Requests as an Issue Resolution Mechanism
Promise III: The interlinking of developers, pull requests, issues and commits provides
a comprehensive view of software development activities
Author's personal copy
Empir Software Eng
For each opened pull request, an issue is opened automatically. Thus, issues and pull
requests are fused together on GitHub. Commits can also be attached to issues to convert
them to pull requests (albeit with external tools). The issue part of the pull request is used
to keep track of any discussion comments. Developers are encouraged to reference issues
or pull requests in commit messages or in issue comments, while GitHub automatically
extracts such references and presents them as part of the discussion flo w. Moreover , both
issues and pull requests can be linked to repository-specific milestones, helping projects
track progress.
The tight integration of issues and pull requests opens a window of opportunity for
detailed studies of developer activity. For example, a researcher can track the resolution
of an issue from the reporting phase, through source code modifications, the code review
and the final integration of the fix. As user actions always affect issues and pull requests,
one could also investigate the formation of user clusters across specific types of activi-
ties, which would reveal emergent user organizations (teams or hierarchies). In addition,
the interlinking of issues, pull requests and commits creates an intricate web of actions
that could be analyzed using social network techniques to discover interesting collaboration
patterns.
There are two shortcomings despite this wealth of interlinked data. First, repository
mining for issue tracking repositories is greatly enhanced if records are consistent across
projects. GitHub’s issue tracker only requires a textual description to open an issue. Issue
property annotations (e.g., affected versions, severity levels) are delegated to repository-
specific labels. This means that issue characteristics cannot be examined uniformly across
projects. Second, across GitHub, only a small fraction (12 %) of repositories that where
active in 2013 use both pull requests and issues. Many interesting repositories, especially
those that migrated to GitHub, have an external issue database (see Peril IV Many active
projects do not use GitHub exclusively).
4.7 On Users
Peril X: Not all activity is due to registered users
GitHub is a service built around git.Ateamofdeveloperswhousegit can choose to
use GitHub for all or some of their development activities. GitHub enables teams to import
their git repositories into GitHub, even if some members of the development team are
not GitHub users. In some cases, such as with “mirrors”, it is possible that no one on the
development team is a registered GitHub user. This implies that some activities recorded in
GitHub are not performed by its registered users.
GitHub allows users to associate one or more email addresses with their account (no two
users can share the same email address). When GitHub receives a commit into a repository
via a push, it uses the email address of the committer and the author field of the com-
mit to associate the commit with a corresponding GitHub user. If the email address is not
registered to a user account, the commit is not linked to the account. For example, 15 reposi-
tories belonging to Kevin Incorvia (username incorvia)containcommitslinkedtotheemail
address Kevin Incorvia <incorvia@Kevins-MacBook-Air.local>,buttheyarenotassoci-
ated to user incorvia because the email address is not associated with that user. Furthermore,
the email address is more likely a computer username rather than an actual email address.
We could speculate that this is due to using a git client, which pulled the GitHub user’s
Author's personal copy
Empir Software Eng
username and the host name of their machine to combine it into an email address, rather
than asking them to enter an email address, as is the case of using the command line. In any
case, if one were to ask for the activity associated with user incorvia, these commits would
not be included. A similar case is the email address being empty or invalid. For example,
the repository TrinityCore/TrinityCore contains commits by the email address megamage
<none@none>.ThesecommitsarenotassociatedwithanyuserGitHubsinterfacedoes
not even show a committer section while displaying the details of the commit and its API
shows null as the author/committer). The impact of this association of emails to commits is
four-fold.
Not all committers or authors of commits are registered GitHub users. By Decem-
ber 2014, we had identified 2.5M registered users and 0.6M email addresses that could
not be associated to registered users (we refer to these email addresses as non-registered
users); 84.4 % of commits (65.2M) were performed by registered users and 15.6 %
(12.1M) by non-registered users. Pull requests, issues and their comments can only be
made by registered users.
A small number of users have commits that predate the creation of their GitHub
user account (1.5 %, 33,227). For example, Linus Torvalds joined GitHub on Septem-
ber 3, 2011, but has commits associated with his user account as early as September 4,
2007.
A committer can make a commit appear as coming from another
user by using one of the other user’s email addresses. For example,
the commit 042343a09967445753b174b0b05c6ef3cfcf7f93 in the repository
aar onr aimi st /public shows Aaron Raimist <[email protected]> as
its author and committer, yet the commit is associated with Linus Torvalds and not
with Aaron Raimist (Aaron is also the owner of the repository where the commit was
found). This issue is probably rare and is difficult to identify.
It might be necessary to perform email unification to fully identify the activ-
ity of each user. While GitHub allows a user to have multiple addresses associated
with their account, the user must associate all their addresses. However, not all users
have registered all the email addresses they have used. For example, Linus Torvalds
has commits in GitHub with 10 email addresses (from the Linux Foundation and the
Open Source Development Labs) that are not associated with his GitHub account.
They have been used in 17,460 of his commits in GitHub, while his GitHub account
has 19,780 commits associated with it. In other words, 47 % of Torvalds’ commits
in GitHub are not associated with his user account. To quantify this effect empiri-
cally, we devised the following experiment. For each project, identify persons who
commit with two different email addresses (the same name, but two different email
addresses) and their name contains at least two words. The assumption is that for each
project there is only one person with a given firstname-lastname combination. In other
words:
Select committers who have a name with at least one space in between. This
step selects committers with at least two words in their name (e.g. Linus Tor-
valds)andavoidsmatchingpeoplewhosharethesamefirstnameorlastname
(e.g. David).
–Foreachofthesenames,counthowmanyemailaddressestheyusedina
project. If the user is registered, we use their preferred email address for the
commit. If the user is not registered, we use the email address in the commit.
Author's personal copy
Empir Software Eng
Note that this method is likely to underestimate duplicated emails per user since
there may be emails that lack a name, or their name may only contain one word, or a
person uses different ways to write their name (e.g., J. Smith and John Smith).
We found that only 30.8 % (664,850) of registered GitHub users have two or more
words in their name, and 17 % of them (90,828 users) have at least one email address
that is not associated with their username in the same project (median of 2). These
email addresses have committed 2.09 million commits (2.7 % of all commits). This
number, however, corresponds to 22.1 % of commits by non-registered committers.
In other words, it is possible to associate approximately 1/4 of commits by non-
registered committers to their corresponding GitHub user. This effect seems to be small,
but it shows that a significant proportion of users have identities that have not been
unified.
Peril Avoidance Strategy For empirical studies that need to map activity to specific users,
use heuristics for email unification to improve the validity of the results.
Peril XI: Only the user’s public activity is visible
When we take a closer look at the activity of registered users, we notice substantial
inactivity. Before concluding that GitHub users appear to be generally inactive, however, we
need to keep in mind that we can only see public activity (actions that take place in public
repositories).
Let us focus on the commit as the basic unit of activity on GitHub. 97 % of the time, the
committer and the author are the same person. Hence, for this analysis we consider them
equivalent and look only at the committer field to estimate commit activity.
We found that out of the registered users on GitHub, 53.2 % do not have a single public
commit. This population can be further divided into two parts: 30 % of the registered users
do not have a public repository either, while the remaining 23.2 % have repositories but no
public commits. These repositories fall into two categories: empty repositories (e.g., used
for testing) and forks that have no activity.
The remaining 46.8 % of registered users (1.04 million) have at least one commit. As
shown in Fig. 6, the distribution of commits per user is highly skewed: the median is 10
commits with an average of 62.4 commits. The inequality in the number of commits per
registered users is substantial (see Fig. 7): 50 % of the commits have been performed by
3.2 % of registered users, and 25 % by 0.6 % of them. This is mainly due to the fact that
some registered users appear overly active with a very high number of commits.
However, considering only commits as a measure of activity excludes users who are
active in other ways. 24.4 % of registered users who do not have any commits have submit-
ted issues or participated in discussions around issues or pull requests. This shows that there
is a subset of GitHub users who do not publicly develop code (they could have private repos-
itories), but are actively contributing to GitHub repositories by identifying bugs, submitting
new feature requests, reviewing code or simply providing guidance to developers.
Peril Avoidance Strategy This peril is unavoidable when using data from public
Websites—acknowledging this partial view in the discussion of results and replicating a
study in other contexts can help reduce its impact.
Author's personal copy
Empir Software Eng
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4 5 10 20 30 50 100 200 1000
Number of commits by User
Cummulative Proportion of Registered Users
Fig. 6 Cumulative ratio of registered users that have n commits. 50 % of users (0.52 million) have less than
10 commits and only 10 % (0.1 million) have more than 50 commits
4.8 GitHub is an Evolving Entity
GitHub is not operating as an archive of software development activities for research pur-
poses; its goal is to provide “powerful collaboration, code review and code management for
open source and private projects.
Peril XII: GitHub’s API does not expose all data
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
Proportion of Registered Users
Proportion of Commits
Fig. 7 Lorenz curve showing that most commits by registered users are made by a small proportion of them.
E.g., 50 % of the commits by registered users have been performed by 3.2 % of them
Author's personal copy
Empir Software Eng
While the repositories hosted in GitHub are continuously evolving, GitHub reports only
their current state (via its
API). It does not report the historical events that shape the current
state of any repository. This results in several challenges for researchers:
GitHub does not provide an
API to retrieve all events. GitHub has created APIs
to list many of its entities (such as users, repositories and commits) and events (such
as opening or closing issues, or pull requests). However, it does not make them all
available. For example, GitHub does not expose pushes to a repository , the creation of
releases, clone operations or when a repository is made public. Some of these events
are available in the events
API of a repository, but this API has the limitation of only
listing the last 300 events.
Not all events are reported with a timestamp. In particular, the APIsforsubscriptions
and “stars” do not return the time when such actions were initiated.
Tracking renamed entities. A renamed repository keeps all its information under the
new name (including its forks). GitHub will redirect the URLs of renamed repositories,
but it will not do it for API requests. For example, the repository anders9898/jekyll
was later renamed to anders9898/zzz.TheURLhttps://github.com/anders9898/jekyll
redirects to https://github.com/anders9898/zzz,buttheAPI request https://api.github.
com/repos/anders9898/jekyll returns “Not Found”. In the case of users, there is no way
to know that a user has been renamed. In this case, both the URL and the API requests
for the old name will fail.
Deleted entities. When a repository is deleted, all of its events and metadata are lost,
but the network of repositories that were forked from it remain untouched; one of its
forks will be chosen as the “root” of the rest of the forks. From this point on, GitHub
will report the deleted repository as “Not found”. Similarly, when commits are deleted,
GitHub has no mechanism to inform that certain commits were deleted from a reposi-
tory. Finally, once a user is deleted, all information regarding them (including the fact
that they were a GitHub user) is lost.
Making a repository private causes it to appear as if it were deleted. Simi-
lar to when a repository is deleted, GitHub will report such a repository as “Not
found”. In this case, its forks will remain public and one of them is chosen as their
root.
Rebased commits. When commits are rebased (one or more commits changes its meta-
data, or are modified and/or combined into new commits) the old commits disappear
from the repository and are replaced by the new commits. GitHub’s events
API does not
document commits that are removed or modified in a push; it only lists the commits that
are added and the head of the branch before and after the commit. Any request for the
old commits in the repository will result in a “Not found” message. The commits that
are deleted or rebased in one repository might still exist in another if they have already
been propagated there.
Propagation of commits. GitHub tracks the movement of commits between reposito-
ries only when commits are merged using a pull request and those commits have not
been rebased or deleted after the merge. In this case, the merged pull request will doc-
ument what commits were merged, including the source and destination repositories.
If the merge was performed outside GitHub, then GitHub has no means of knowing
what the true source of the commits was. For example, assume Sally creates a fork F of
repository A and then clones it to her computer. Next, she “pulls” changes from another
fork G and then pushes the commits to her GitHub fork F . Under this scenario, GitHub
has no means of knowing that the new commits in her repository came from G.If,on
Author's personal copy
Empir Software Eng
the other hand, there was a pull request from G to F and Sally merges it, then GitHub
will know that those commits were moved from G to F .
Peril avoidance strategy Obtaining data from one of the services which archives data
from the GitHub
API (like GHTORRENT) can help avoid this peril. However, researchers
should be aware that such services contain their own assumptions regarding the collected
data.
Peril XIII: Github is continuously evolving
Over time, GitHub has changed some of its features and interface. For example,
GitHub’s “watch” feature was originally intended to be used by those who wanted to
receive notifications regarding activity (commits, pull requests and issues) for any reposi-
tory of their choosing. In August 2012, GitHub decided to improve their notification system
(Neath 2012). The first change was the introduction of “starring”. “Starring” a repository
is equivalent to bookmarking it. Any previously “watched” repository became a “starred”
repository, and the old notification system surrounding “watchers” changed to an opt-out
subscription of events notification system (a person with commit privileges to a repository
automatically “watches” the repository). As a way to maintain backwards compatibility
with external applications that used this information, GitHub currently returns the list of
“starred” projects under its Wat ched API (e.g., /users/:user/watched). This introduces two
potential problems for researchers. First, the meaning of “watchers” is different before and
after August 2012. Second, there is a dissonance between GitHub’s interface and its
API:
“Watchers” are returned using the Subscriptions
API,and“stars”(thosewhohave“starred
a repository) are returned via the Watched
API.
More recently (November 2014), GitHub silently disabled the ability to retrieve reposi-
tory collaborators for a specific repository. The only way to retrieve this information now is
to query and keep track of the live event stream for a particular repository or sets or projects.
Both the above changes had a direct impact on the
API.Therearealsofrequentchanges
to the GitHub interface that do not leave an
API footprint, yet they have the potential of
changing user behavior. As an example, the issue tracking system in GitHub was improved
significantly on July 28, 2014.
10
The interface changes related to improving search and
filtering, showing a timeline of issue-related activities (such as assigning labels, changing
issue names and adding comments) and improving the editing of milestones and labels
for issues. Although the user data created and stored was not affected, researchers should
still keep track of when the GitHub interface changed and what the improvements were.
Capturing and measuring the actual impact of interface changes on user behavior would be
aseparateresearchundertakingasitisbeyondthescopeofthispaper.However,wewant
to keep researchers alert that patterns seen in the data could be explained by accounting for
changes in the interface, reflecting changes in user behavior, even if they were not captured
in the
API.
Peril Avoidance Strategy Understand how both the GitHub API and the Website have
evolved over time. Changes to the Website are often posted to the GitHub blog,
11
but this is
not guaranteed.
10
https://github.com/blog/1866-the-new-github-issues
11
https://github.com/blog/category/ship
Author's personal copy
Empir Software Eng
4.9 Relationship Between Perils
It is possible for one project to be subject to more than one perils. To calculate the extend this
can happen in our dataset, we calculated the pairwise appearance of the perils in our dataset.
More formally, for each peril P
A
in the set of perils P ={P
1
,...,P
1
0}, we calculated a list
of projects L
P
A
which this peril may affect. Then, we examined whether each peril T
B
in
the set T = P \{P
A
} also affects the projects in L
P
A
and therefore came up with a second
project list L
T
B
P
A
.TheratioofprojectsinL
P
A
that are affected by both perils P
A
and T
B
is |L
P
A
L
T
B
P
A
|/|L
T
B
P
A
|. This is done only for perils that could be quantified. These results
are shown in Table 4. For these calculations we only considered projects with at least one
fork. For example, projects affected by Peril II (projects with less than 6 commits) overlap
with 93 % of projects affected by Peril III (projects that have been inactive during the last
month). But projects affected by Peril III overlap only 24 % with projects affected by Peril
II. Similarly, projects affected by Peril V (projects with one committer) overlap 70 % with
project affected by Peril III (projects that have been inactive during the last month). The
opposite, projects affected by Peril III only overlap 19 % with those affected by Peril V.
Overall, we can see that there is a significant overlap among Perils. Peril III (most
projects are inactive) seems to be strongly related with most other perils. In general, the days
a project has been active (number of days from earliest to latest commit) is highly correlated
with number of commits (0.61—all correlations presented in this section are Spearman and
all have negligible p-value), number of forks (0.35), and number of committers (0.40). All p-
values are negligible. Other Perils seem correlated in one direction: Peril II is highly related
Tabl e 4 Percentage of projects susceptible to more than one peril
Peril P I P II P III P V P VII P X
PII 22 24 12 19 14
PIII 83 93 78 70 70
PV 20 12 19 25 0
PVII 66 70 71 59 52
PX 25 15 20 42 37
Peril Quantification method
PI Arepositoryisnotnecessarily Projectsthathaveatleastonefork.
aproject
PII Mostprojectshavelowactivity Projectsthathadlessthan6commits.
P III Most projects are inactive Projects that had no activity (commit, pull request, issue)
during the last month (Dec 2013).
P V Most projects are personal Projects that had only one committer.
P VII Few projects use pull requests Projects that had never received a pull request (any time).
P X Not all activity is due to registered Project that had commits by non-registered users.
users
The table is read as follows: “From the repositories that are susceptible to peril x (column) Y % are also
susceptible to peril z (row)”. For these computations we only considered project with at least one fork. We
only include perils whose effect can be quantified. For example, project affected by Peril II (project with less
than 6 commits) overlap with 93 % of project affected by Peril III (repos that have been inactive during the
last month). But projects affected by Peril III overlap only with 24 % of projects affected by Peril II
Author's personal copy
Empir Software Eng
to Peril VII (projects with no Pull requests) but the opposite is not true (many projects with
no pull requests have many contributors).
Regarding users-related perils. There is a very strong correlation between the number
of commits made by a registered user and: a) the number of projects a user contributes to
(0.72), b) the days of activity of the user (number of days from earliest to latest commit–
0.80), and c) inversely correlated with the number of days since latest contribution (-0.38).
Similarly, there is a very strong correlation between the number of days a developer has
been active and the number of projects they own (0.77), and inversely correlated with the
days since latest contribution (0.37).
Given that many perils seem to relate to each other, we suggest the same guidelines to
choose active projects for research:
Have many committers, and
Had recent activity, and
Have a large number of commits.
Alternatively one could choose only projects with a large number of recent pull-requests,
but these will result in fewer candidates.
Regarding users, the overlap is more straightforward. It appears that users, the more
they use github, the more likely they will create their own repositories for their own use.
Therefore, choose registered users that have been active recently, that contribute to several
projects (including many that are personal) and that have a large number of commits. In our
experience bot-users tend to commit to a single repository.
Ultimately, any research that tries to use GitHub as a data source should consider these
perils in context to the research questions that it tries to address.
5AnAnalysisoftheMSR2014MiningChallenge
In the MSR 2014 Mining Challenge (Baysal and Gousios 2014), researchers were given
asubsetofthe
GHTorrent dataset to analyze and derive new insights. The competition
resulted in nine accepted papers. In view of the perils presented, we analyzed the dataset
and accepted papers to determine if any of our perils might have posed potential threats to
validity to the results presented in these papers.
5.1 The Dataset
The organizers of the Mining Challenge decided that the entire GHTorrent dataset was too
large. Instead, 90 repositories were selected as follows: for each of the top 10 programming
languages (including Javascript, Java and other popular languages), the top 10 most active
repositories in terms of pull requests processed in 2013 (up to September 2013) where
initially selected. The original selection was then hand-cleaned to remove repositories that
where not software development ones.
Below we discuss how the identified perils could have an impact on insights derived from
this dataset.
Peril I A repository is not necessarily a project The data contains 90 projects, but
unfortunately, the schema of the dataset refers to a repository as a “project” and it does
not include an entity for “project”. Projects must be inferred by recursively traversing the
Author's personal copy
Empir Software Eng
forked from field of the “projects” table to identify the project a repository belongs
to. However, one repository could not be linked to its project (xphere-forks/symfony).
Peril II Most projects have low activity There are 3 projects with less than 100 com-
mits, while 3 have more than 40,000 commits; 10 repositories account for 50 % of the
commits.
Peril III Most projects are inactive The impact of this peril is small: only 4 repositories
were inactive in the last 6 months. In contrast, 65 repositories had been active in the last
week and 71 in the 2 weeks before.
Peril IV Many projects are not software development The impact of this peril is also
small: one repository was a personal Website (vinc/vinc.cc), while another one is a
book on R programming (mavam/stat-cookbook). Another is a collection of icons
(FontAwesome/Font-Awesome).
Peril V Most projects are personal One of the projects only had one committer
(vinc/vinc.cc), and one had three committers (mavam/stat-cookbook). Again, the impact
of this peril is small.
Peril VI Many active projects do not use GitHub exclusively jquery/jquery,
mono/mono, ServiceStack/ServiceStack, django/django, clojure/clojure do not use
GitHub for issues. For example, Clojure uses Jira—this is where it suggests non-regular
contributors should submit patches instead of GitHub. Jquery hosts its own bug tracking
system at bugs.jquery.com.Monousesbugzillawww.mono-project.com/Bugs.Atleast
one repository is a mirror (TTimo/doom3.gpl)withincompletedevelopmenthistory;it
was imported from a release of the game.
Peril VII Few projects use pull requests In this dataset, the majority of the projects
use pull requests: 88 of the 90 repositories. The median number of pull requests
per project is 393. However, 3 repositories account for 34 % of the pull requests
(mxcl/homebrew, rails/ralis and symfony/symfony), and 7 repositories account for
50 %.
Peril VIII Merges only track successful code This is an overarching peril inherent in
GitHub data due to the way GitHub reports merged pull requests and the commits they
contain.
Peril IX Many merged pull requests appear as non-merged Some projects do not
close many of their pull requests using the GitHub “Merge” button. Instead, they do it
via commits in their local repositories. One project, mxcl/homebrew,posesanimpor-
tant threat to validity if we assume pull requests that are not marked-as-merged were not
actually merged. This project is the one with the most pull requests (it accounts for 17 %
of pull requests in the dataset) and only 0.9 % them are marked as “merged”. However,
as part of their development process they close pull requests via the log of a commit.
We found Gousios and Zaidman (2014a)that52%ofpullrequestshadactuallybeen
merged (6,753 pull requests in the MSR dataset were actually merged). django/django
also does not always close the pull request via the “Merge” button. In that case, we
found that 819 pull requests had been merged (41 % more, for a total of 63 %, com-
pared to 23 % found in the MSR dataset). In Bukkit/CraftBukkit, 23.8 % of commits
were merged in commits only (274). If we were to include these 7,846 pull requests as
merged, the percentage of merged pull requests grows from 45 % to 55 % for the entire
dataset.
Peril X Not all activity is due to registered users The impact of this peril is marginal:
only 1.1 % of committers in the dataset are not registered users.
Peril XI Only the user’ s public activity is visible This is an overarching peril inherent
in GitHub datasets due to the fact that they contain data from public repositories.
Author's personal copy
Empir Software Eng
Peril XII GitHub’s API does not expose all data Some of the projects in the dataset
have moved and are no longer active. The root repository of homebrew was renamed
from mxcl/homebrew to homebrew/homebrew, although all the information was moved
and is still available. Both mangos/MaNGOS and TTimo/doom3.gpl are dead. In the
case of mangos/MaNGOS,themaintrunkoftherepositoryhasbeenscrubbedofsource
code and moved to cmangos/mangos-classic,butthepullrequestsandissuesoftheold
project were not moved to the new one. The author of TTimo/doom3.gpl keeps the repos-
itory for archival purposes; its development appears to have moved to dhewm/dhewm3.
Other repositories have moved: facebook/php-sdk moved to facebook/facebook-php-sdk
and no longer exists (it was not renamed). The watchers table of the dataset contains a
field created at but this field corresponds to the date in which the watcher was discov-
ered by
GHTorrent, not the date the person became a watcher (this information is not
exposed by GitHub’s API, as described in Peril XII GitHubs API does not expose all
data).
Peril XIII Github is continuously evolving In the MSR dataset, watchers correspond to
today’s “starrings”. This is because “watchers” are now a subscription mechanism, as
explained in Section 4.8.
5.2 The Papers
In light of the issues we have outlined above, we can illustrate the use of the perils for
identifying potential threats to validity or offering alternative explanations. We use select
papers published in the MSR Challenge of MSR’14 as examples and comment on some of
the assumptions these papers made, contrasting them to the perils discussed earlier. We note
that we are not making claims as to the validity of the results in the papers since we have
not replicated the studies; we leave that for future studies that attempt to replicate or extend
those studies.
In Sheoran et al. (2014), we looked at watchers on GitHub and assessed if and when
watchers become contributors and what types of contributions watchers make to the repos-
itories they watch. One of the research questions involved investigating how long it takes
for watchers to contribute to the repository, in any form; this was done through using the
created
at field of Watchers. As mentioned in the previous section, the created at field cor-
responds to the date
GHTorrent recognized a user as a watcher, not the date the user became
awatcher.IfGHTorrent did not capture these watcher events as they occurred, the differ-
ence in timings can have an effect on the time the study concludes it takes for a watcher to
contribute to a project.
Rahman and Roy (2014)analyzedGitHubpullrequests.Thestudycomparedsuccess-
ful and unsuccessful pull requests against factors such as discussion items, pull request
history, and selected project and developer characteristics. The goal of the comparative
analysis was to identify factors that play a role in the success or failure of pull requests.
The analysis considered merged pull requests as successful, while marked-as-non-merged
as unsuccessful. As described above (Peril IX Many merged pull requests appear as non-
merged applied to the MSR Data set), a significant number of pull requests are merged but
not marked-as-merged (e.g., django/django, Bukkit CraftBukkit and mxcl/homebrew). This
issue could have impacted the results in the paper; a different number of successful and
unsuccessful pull requests can lead to different conclusions about the influence of the iden-
tified factors. For example, the peril could explain the outliers in Figs. 5 and 6 and also
why languages like Ruby (mxcl/homebrew)andJava(Bukkit/CraftBukkit)havealowratio
of marked-as-non-merged pull requests.
Author's personal copy
Empir Software Eng
Padhye et al. (2014)alsoanalyzedGitHubpullrequestsoperatingunderthesame
assumption that not-marked-as merged pull requests were non-merged. The study dis-
tinguished between core, external and mutant commits based on whether they were
merged in the base repository. Respectively, the study labeled committers according
to their commits to identify communities and characterize them. As we noted above,
the actual proportion of merged pull requests in the dataset changes if we count
pull requests that are not marked-as-merged as merged, going up from 45 % to at
least 55 %. This fact could have an effect on the set of commits that are marked
as mutant in the study and potentially also reduce the number of committers labeled
as mutant.
Matragkas et al. (Matragkas et al. 2014) analyzed user activity in projects to cluster
users into roles, investigating the structure of the ecosystem of open source communities on
GitHub. In the study each repository is considered and referred to as a “project”, regardless
of whether it is a base repository or a fork of one (see Table 1 in Matragkas et al. (2014)).
The rationale behind this choice is that it is hard to determine if work done in a fork is
collaboration with other repositories or independent work that will not be contributed to
other repositories; hence it is safer to consider them as separate.
12
Some forks will indeed
not contribute back to the base repository, but it is difficult to determine if they will not.
Peril I A repository is not necessarily a project could influence the results of the study,
since considering some or all forks as part of a larger project would likely create larger
clusters. Furthermore, the analysis counted the number of issues and issue comments per
user. Under Peril IV Many active projects do not use GitHub exclusively,thesizeofthe
clusters may be underestimated if projects are using an issue tracker that is external to
GitHub.
The examples above demonstrate the potential threats to validity that the perils pose. This
does not mean that the studies we critiqued (or others that use the same data) are flawed.
Rather, it highlights that there are issues that need consideration when processing the data
and drawing conclusions from it, and that need to be acknowledged in the discussion of a
study’s threats to validity.
6ComparingPerilsBetweenSourceForgeandGitHub
Publicly available repositories are attractive data sources for researchers, but not without
perils. There have been previous studies taking a critical look into the quantity and quality
of data on public sources, with the most notable example being the identified perils and
pitfalls in mining data from projects hosted on SourceForge (Howison and Crowston 2004).
These perils related to three areas: data collection, interpretation and analysis, and research
design. In Table 5 we highlight the similarity of our conclusions.
In this paper we have not concerned ourselves with data collection perils because we used
an already existing dataset, GHTorrent. In contrast, Howison & Crowston constructed their
own dataset and, therefore, came across challenges and tradeoffs on how to mine the data in
the first place, before analyzing it. The same applies to other studies that have used Source-
Forge data, e.g (Weiss 2005). The assumptions and heuristics in
GHTorrent are described
and assessed in previous work (Gousios 2013;GousiosandZaidman2014b).
12
The authors clarified this view in private communication.
Author's personal copy
Empir Software Eng
Regarding the interpretation and analysis of data mined from SourceForge, Howison and
Crowston recognized two challenging sub-areas: cleaning dirty data, and skewed data. In
cleaning dirty data, similar to our conclusions, the authors observed that manual checking
is essential since there is a lot of anonymous data (similar to our Peril X: Not all activity
is due to registered users) while for many projects SourceForge may be the “repository of
record” but not the “repository of use” (similar to our Peril VI: Many active projects do not
use GitHub exclusively). Our Perils IV and V (Many projects are not software development
and Most projects are personal) also relate to cleaning dirty data, since repositories would
require manual inspection to categorize them properly.
Another peril in interpreting data from SourceForge (Howison and Crowston 2004,
Weiss 2005)ishowskewedthedatais,whichhasalsobeenourobservationafterreviewing
GitHub data. Researchers need to be conscious of the data skewness and the fact that they
will need to use screening variables to get data that is relevant to and representative of the
properties they want to study, but also that the use of screening variables will significantly
reduce the number of repositories and projects studied. We made the same observation
relative to five of our perils too, noted in Table 5.
Finally, Howison and Crowston suggested caution to researchers designing studies using
SourceForge data; the website provides a few easy-to-compute variables calculated for
projects (the authors call them “ready-made”), but researchers use them to draw conclusions
for complex theoretical constructs. This is a validity threat in itself, complicated by the fact
that different literature areas may use the same variables as proxies for different concepts.
The same caution applies in the case of GitHub data too. We also concluded that the sim-
plicity of metrics may hide dangers; the number of commits in a pull request, for example,
can be a simple metric to calculate but, given that GitHub does not report the intermediate
commits that led to a merged pull request, could be a problematic proxy for the effort that
was put in a successful merge. Perils XI, XII, and XIII also need to be taken into account in
any research design so that conclusions do not include misinterpretations.
Surprisingly, the perils identified over ten years ago about the interpretation of data
mined from public repositories and the research design of studies that build on that data are
equally relevant today. Even though the data sources may have changed, researchers still
have to be careful in how they extract data, how they analyze and interpret it, and how they
make conclusions about software development.
Tabl e 5 Comparison between perils identified in (Howison and Crowston 2004)andourstudy
SourceForge perilous areas GitHub perils
Data Collection
-Spidering Datacollectionandsummarizingrelatestothe
-Parsing GHTorrent dataset, explained in (Gousios 2013)
-Summarizing and (Gousios and Zaidman 2014b)
- Testing
Interpretation
-Cleaningdirtydata PIV,PV,PVI,PX
-Skeweddata PI,PII,PIII,PVII,PIX
Research design
P VIII, P XI, P XII, P XIII
Author's personal copy
Empir Software Eng
7ThreatstoValidity
Our study has several limitations and threats to validity. The exploratory survey had a rel-
atively low number of participants from a biased and self-selected population. While it
motivated us to investigate the perils in more detail, we can draw no further conclusions
from it. Our manual exploration of 434 projects illustrates the variety of uses of GitHub, but
we do not generalize our results to other projects. Further, for each repository, the manual
analysis was performed by only one person introducing a possible risk of unreliable results.
However, the set of 434 projects was split amongst two coders, who each independently
identified the same categories of projects within their set of repositories.
This study was based on analysis of the GHTorrent dataset, and therefore, the reliability
of our work depends partly on the reliability of the GHTorrent dataset. GHTorrent is a best-
effort approach to collect data from the GitHub API and previous work Gousios (2013)
analyzed the reasons why GHTorrent cannot be a full replica of GitHub. The accuracy of the
heuristics to detect pull requests merged outside GitHub is detailed in Gousios and Zaidman
(2014b).
We mitigated these threats by triangulating quantitative with qualitative data from
surveys, interviews and manual inspections.
To ameliorate these threats, we provide a replication package for our study. The package
contains the results of our manual analysis as well as other data and scripts used in this
work. The GHTorrent data is publicly available.
13
The replication package of this paper is available at http://turingmachine.org/
gitMiningPerils2014.
8Discussion&Conclusions
Mined data does not always tell the whole story. This has been a finding in studies that
assess the quality and completeness of data mined from project archives, but also in rare
cases where the mined data is compared to qualitative evidence (Aranda and Venolia 2009).
In this empirical study, we set out to critically look at the publicly available data com-
ing from GitHub and assess whether it is suitable as a data source for software engineering
studies. The data can be readily used to report on several project properties. If a researcher
seeks to see trends of programming language use, type of tools built, number and size of
contributions, and so on, the publicly available data can give solid information about the
descriptive characteristics of the GitHub environment. However, using GitHub to synthesize
information to draw conclusions about more abstract constructs needs some considera-
tion. We presented evidence of how assumptions about repository activity and contents,
as well as development and collaboration practices, can be challenged. We recommend
that researchers interested in performing studies using GitHub data first assess its fit and
then target the data that can really provide information towards answering their research
questions.
Some potential perils manifest relative to the repository activity. One of the biggest
threats to validity to any study that uses GitHub data indiscriminately is the bias towards
13
http://ghtorrent.org/downloads.html
Author's personal copy
Empir Software Eng
personal use. While many repositories are being actively developed on GitHub, most of
them are simply personal, inactive repositories. Therefore, one of the most important ques-
tions to consider when using GitHub data is what type of repository one’s study needs and
to then sample suitable repositories accordingly.
While we believe there to be a need for research on the identification and automatic
classification of GitHub projects according to their purpose, we suggest a rule of thumb.
In our own experience, the best way to identify active software development projects is
to consider projects that, during a recent time period, had a good balance of number of
commits and pull requests, and have a number of committers and authors larger than 2. The
number of issues can also be used as an indicator, but not all active projects use GitHub’s
issue tracker, such as several Mozilla projects.
14
Outliers, especially those with a very large
number of commits per committer, point towards automatic bots.
When looking at any specific project, researchers need to keep in mind that other repos-
itories might exist in the project—some of them working towards a common goal and some
possibly being independent versions that will never contribute back. Based on our work,
we believe a simple way to determine whether a repository actively works with another
might be to identify if commits have flown from one to the other in both directions, but this
strategy requires further validation.
Other potential perils manifest relative to the users and their characteristics. User actions
might be taking place elsewhere and recorded as activity on GitHub, and due to non-
unification of email addresses, not all of a user’s activity is necessarily attributed to them.
Both facts can distort the image researchers form of user activity and, therefore, potentially
influence their conclusions. It is important to look more closely at the users’ characteristics
in light of the presented perils before drawing inferences and/or be aware of the potential
threats to validity.
Apart from an exciting data source, GitHub is also an evolving entity. Its range of features
and the integration between them changes frequently, meaning that the
API also changes.
This cannot be considered a flaw or attributed to GitHub as such; it simply puts more respon-
sibility on researchers to remain aware of changes and factor them in their analysis. The
same applies to information that is part of GitHub’s functionality, yet reported partially or
not at all.
One last conclusion point is to advocate complementing quantitative studies with quali-
tative data. By all evidence we have presented, there are shortcomings in the data that can
pose a danger to the conclusions of any rigorous study. Especially given Peril XII GitHubs
API does not expose all data,researchersmaynotevenbeabletohavedirectaccesstoinfor-
mation that could inform their interpretation of project or user activity. Getting additional
qualitative input regarding projects and users can give more confidence in the assumptions
that researchers make. For example, surveys that solicit comments from participants could
be a solid information source.
We showcased the potential impact of the perils in a familiar and appropriate setting: the
MSR 2014 Mining Challenge. The take-away message is that perils in the data equals perils
in assumptions equals perils in results. We provided examples from our own and others’
studies in the hope that researchers that continue to mine GitHub will be cautious of the
underlying assumptions and informed about potential validity threats.
GitHub is a remarkable resource. It continues to grow at an accelerated rate and its users
are finding innovative ways to exploit it. Nevertheless, software development is flourishing
14
https://github.com/mozilla
Author's personal copy
Empir Software Eng
in the open within GitHub’s infrastructure and will continue to be an attractive source to
mine for research in software engineering.
Acknowledgments We would like to thank the authors of Padhye et al. (2014)andMatragkasetal.(2014)
for their valuable feedback regarding the evaluation of the impact of these perils on their research. We would
also like to thank Margaret-Anne Storey for her invaluable help in the development of this paper.
References
Aranda J, Venolia G (2009) The secret life of bugs: Going past the errors and omissions in software
repositories. In: Proceedings of the 31st international conference on software engineering, pp 298–
308
Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of m odern code review. In: Proceedings
international conference on soft engineering, ICSE ’13, pp 712–721
Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix com-
mits. In: Proceedings of the 18th ACM SIGSOFT international symposium on Foundations of software
engineering, pp 97–106
Baysal O, Gousios G (2014) The MSR’14 Mining Challenge., http://2014.msrconf.org/challenge.php
Begel A, Bosch J, Storey MA (2013) Social networking meets software development: perspectives from
github, msdn, stack exchange, and topcoder. Software, IEEE 30(1):52–66
Bird C, Bachmann A, Aune E, Duffy J, Bernstein A et al. (2009a) Fair and balanced?: bias in bug-fix datasets.
In: Proceedings of the the symposium on the foundations of software engineering, pp 121–130
Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009b) The promises and perils of
mining git. In: Mining software repositories, (MSR’09). IEEE, pp 1–10
Bissyande TF, Lo D, Jiang L, Reveillere L, Klein J, Le Traon Y (2013) Got issues? who cares about it? a
large scale investigation of issue trackers from github. In: 2013 IEEE 24th international symposium on
software reliability engineering (ISSRE). IEEE, pp 188–197
Corbin J, Strauss A (2008) Basics of qualitative research: Techniques and procedures for developing
grounded theory. Sage
Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in GitHub: transparency and collaboration
in an open software repository. In: Proceedings conference on computer supported cooperative work,
pp 1277–1286
Finley K (2011) Github Has Surpassed Sourceforge and Google Code in Popularity., http://readwrite.com/
2011/06/02/github-has-passed-sourceforge
Gousios G (2013) The GHTorrent dataset and tool suite. In: Proceedings of the 10th Conference on mining
software repositories, MSR ’13, pp 233–236. http://dl.acm.org/citation.cfm?id=2487085.2487132
Gousios G, Spinellis D (2012) GHTorrent: GitHub’s data from a firehose. In: MSR ’12: proceedings of the
9th working conference on mining software repositories, pp 12–21
Gousios G, Zaidman A (2014a) A dataset for pull-based development research. In: Proceedings of the 11th
working conference on mining software repositories, MSR 2014, pp 368–371
Gousios G, Zaidman A (2014b) A dataset for pull-based development research. In: Proceedings of the 11th
working conference on mining software repositories, MSR 2014, pp 368–371
Gousios G, Pinzger M, Av D (2014) An exploratory study of the pull-based software development model.
In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 345–
355
Gousios G, Zaidman A, Storey MA, Av D (2015) Work practices and challenges in pull-based develop-
ment: The integrator
˘
A
´
Zs perspective. In: Proceedings of the 37th international conference on software
engineering, ICSE 2015, to appear
Grigorik I (2012) The Github archive., http://www.githubarchive.org/
Howison J, Crowston K (2004) The perils and pitfalls of mining sourceforge. In: Proceedings of the
international workshop on mining software repositories, pp 7–11
Kalliamvakou E, Damian D, Singer L, German DM (2014a) The code-centric collaboration perspective:
evidence from GitHub. Technical Report DCS-352-IR, University of Victoria
Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014b) The promises and perils
of mining github. In: Proceedings of the 11th working conference on mining software repositories, MSR
2014, pp 92–101
Author's personal copy
Empir Software Eng
Kochhar PS, Bissyand
´
eTF,LoD,JiangL(2013)Adoptionofsoftwaretestinginopensourceprojectsa
preliminary study on 50,000 projects. In: 2013 17th European conference on software maintenance and
reengineering (CSMR). IEEE, pp 353–356
Marlow J, Dabbish L, Herbsleb J (2013) Impression formation in online peer production: activity traces and
personal profiles in github. In: Proceedings of conference computer supported cooperative work, pp 117–
128
Matragkas N, Williams JR, Kolovos DS, Paige RF (2014) Analysing the ’biodiversity’ of open source ecosys-
tems: The github case. In: Proceedings of the 11th working conference on mining software repositories,
MSR 2014, pp 356–359
McDonald N, Goggins S (2013) Performance and participation in open source software on github. In: CHI’13
extended abstracts on human factors in computing systems. A CM, pp 139–144
Neath K (2012) Notifications &amp; stars., https://github.com/blog/1204-notifications-stars
Nguyen TH, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: 2010 17th working
conference on reverse engineering (WCRE). IEEE, pp 259–268
Padhye R, Mani S, Sinha VS (2014) A Study of External Community Contribution to Open-source Projects
on GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014,
pp 332–335
Pham R, Singer L, Liskin O, Figueira Filho F, Schneider K (2013) Creating a shared understanding of testing
culture on a social coding site. In: Proceedings of international conference on soft engineering, ICSE
’13, pp 112–121
Rahman F, Posnett D, Herraiz I, Devanbu P (2013) Sample size vs. bias in defect prediction. In: Proceedings
of the 2013 9th joint meeting on foundations of software engineering, pp 147–157
Rahman MM, Roy CK (2014) An insight into the pull requests of GitHub. In: Proceedings of the 11th
working conference on mining software repositories, MSR 2014, pp 364–367
Rainer A, Gale S (2005) Evaluating the quality and quantity of data on open source software projects.
In: Proceedings of the first international conference on open source systems (OSS 2005), pp 29–
36
Rigby PC, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the
2013 9th joint meeting on foundations of software engineering, ESEC/FSE 2013, pp 202–212
Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of the
Apache server. In: Proceedings of the 30th international conferences on software engineering, ICSE ’08,
pp 541–550
Sheoran J, Blincoe K, Kalliamvakou E, Damian D, Ell J (2014) Understanding ”watchers” on github. In:
Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 336–339
Takhteyev Y, Hilts A (2010) Investigating the geography of open source software through github. http://
takhteyev.org/papers/Takhteyev-Hilts-2010.pdf
Thung F, Bissyande T, Lo D, Jiang L (2013) Network structure of social coding in GitHub. In:
17th European conference on software maintenance and reengineering (CSMR), pp 323–326.
doi:10.1109/CSMR.2013.41
Tsay J, Dabbish L, Herbsleb J (2014) Influence of social and technical factors for evaluating contribution
in github. In: Proceedings of the 36th international conference on software engineering, ICSE 2014,
pp 356–366
Tsay JT, Dabbish L, Herbsleb J (2012) Social media and success in open source projects. In: Proceedings of
computer supported cooperative work companion, pp 223–226
Wagstrom P, Jergensen C, Sarma A (2013) A network of rails: a graph dataset of ruby on rails and associated
projects. In: Proceedings of the 10th international work conferences on mining software repositories,
pp 229–232
Weiss D (2005) Quantitative analysis of open source projects on sourceforge. In: Proceedings of the first
international conference on open source systems (OSS 2005), pp 140–147
Author's personal copy
Empir Software Eng
Eirini Kalliamvakou is a PhD candidate at the Department of Computer Science in the University of Victoria
in Canada. She is a research assistant in the Software Engineering and Global interAction Lab (SEGAL),
where she is researching collaborative software development. Her interests focus on the development and
collaboration practices and tools that teams use in software companies, and the adoption of practices inspired
by open source software projects. You can contact Eirini at [email protected].
Dr. Georgios Gousios is an assistant professor at the Radboud University in Nijmegen, the Netherlands. His
research interests include software engineering, software analytics, software repository mining and program-
ming languages. He is the main author and maintainer of the GHTorrent dataset, the Alitheia Core software
analysis platform and various OSS tools targeted to software analytics. Georgios received a PhD from the
Athens University of Economics and Business. Contact him at @gousiosg (Twitter, Github, Gmail) or at
http://gousios.gr.
Author's personal copy
Empir Software Eng
Kelly Blincoe received a BE in Computer Engineering from Villanova University in 2004, an MS in Infor-
mation Science from Pennsylvania State University in 2008, and an MS and Ph.D in Computer Science
from Drexel University in 2011 and 2014 respectively. She is currently a Lecturer at Auckland University of
Technology. Her most recent position was Postdoctoral Fellow at the University of Victoria in the Software
Engineering Global interAction Lab. She previously worked at Lockheed Martin as a Proposal Manager and
Software Engineer. Her research interests lie in collaborative software engineering and computer-supported
cooperative work.
Leif Singer is a researcher and consultant based in Hannover, Germany. He got his PhD in computer science
from the University of Hannover and is an Affiliate Researcher with the University of Victoria in Canada.
At the New York-based startup iDoneThis.com, he uses both qualitative and quantitative research to steer
product development for a collaboration tool used by companies such as Twitter, Airbnb, and Heroku. He
can be contacted via Twitter (@LSinger) or his website (http://leif.me).
Author's personal copy
Empir Software Eng
Daniel M. German is professor of Computer Science at the University of Victoria. He completed his PhD at
the University of Waterloo in 2000. His work spans the areas of mining software repositories, open source,
and intellectual property in software engineering. For more information, visit http://turingmachine.org.
Daniela Damian is a Professor of Software Engineering in University of Victoria’s Department of Com-
puter Science, where she leads research in the Software Engineering Global interAction Laboratory (SEGAL,
thesegalgroup.org). Her research interests include Empirical Software Engineering, Requirements Engineer-
ing, Computer-Supported Cooperative Work. Her recent work has studied the developers’ socio-technical
coordination in large, geographically distributed software projects, as well as stakeholder management in
large software ecosystems.
Author's personal copy