TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial…

Follow publication

Notes from Industry

Detecting Potential Bad Actors in GitHub

Chris Tokita
TDS Archive
Published in
8 min readSep 8, 2021

--

Source: Unsplash

Using data to create an understanding of an author’s behavior

Figure 1 — An example of behavior data we can extract from an author’s Github data. Each panel compares different dimensions of the Github commit data. Each point shown is a commit that the author pushed to a project.

Teaching a machine to find unusual author behavior

Figure 2 — A visual explanation of DBSCAN, a machine learning clustering algorithm. Points within a certain distance of each other are considered neighbors. Points with enough neighbors are considered a cluster’s core. Points with few neighbors mark a cluster’s edge. Points without neighbors are outliers. (Figure adapted from Wikipedia entry on DBSCAN.)

Detecting unusual behavior automatically

Figure 3 — (Left) Showing the author’s commits across day of week and time of day, with commits that the model thinks are unusual flagged in orange. (Right) Using PCA to project the author’s commit data — which has many features — into a two-dimensional space, allowing us to easily show the data clusters (shades of blue) and outliers (orange).

Detecting risky authors at scale

Figure 4 — Our author risk score combines the relative number of commits and the relative % of commits our ML model thinks are unusual. In the end, some authors really stand out from the crowd as potentially risky.

Moving forward

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Chris Tokita
Chris Tokita

Written by Chris Tokita

computational {ecologist, social scientist} turned data scientist for social good • made in LA • educated: LAUSD, Yale, Princeton PhD

No responses yet

Write a response