In the last few months alone, cyber attackers have registered more than 100,000 (but by some estimates more than a million) repositories of malicious copies on GitHub.
The “repository confusion” scheme is simple: programmatically copy, trojanize, and reload existing repositories, hoping that developers download the wrong one.
GitHub’s automatic security mechanisms appear to identify and remove most of these cheap fakes, but according to new Apiiro researchmany are still seeping through the cracks.
Anatomy of a confusing repository attack
Repository confusion works just like dependency confusion in package managerstricking unwitting developers into downloading nearly identical copies of the code they actually want, with malware silently added as a bonus.
This malware, in turn, gets incorporated into software designs and causes downstream supply chain risks.
The key to the success of this latest campaign is automation. The attacker automatically cloned, infected, and reuploaded repositories at scale, pushing what researchers estimate to be millions of repositories in all. And to add legitimacy, the automation process forks these projects thousands of times each and promotes them on various forums and web apps.
Therefore, when sleep-deprived or multitasking developers fork the copy instead of the original, they will be provided with a heavily obfuscated copy of BlackCap Grabber, which collects credentials from various apps, browser cookies and other data, among other harmful functions.
GitHub, for its part, removed most of these malicious repositories within hours of their publication.
“However, automatic detection appears to miss many repositories, and the manually uploaded ones survive. Since the entire attack chain appears to be mostly automated at scale, the 1% that survive still amounts to thousands of malicious repositories,” Apiiro he explained in his blog post.
A GitHub spokesperson said the organization is working to extract the malicious code. “GitHub is home to over 100 million developers using over 420 million repositories and is committed to providing a safe and secure platform for developers. We have teams dedicated to detecting, investigating and removing content and accounts that violate our Acceptable use policies. We use manual reviews and large-scale detections that use machine learning and are constantly evolving and adapting to adversary tactics,” the spokesperson said in a statement. “We also encourage customers and community members to report abuse and spam.”
Why GitHub is used for confusion attacks
GitHub by its nature offers some advantages for confusing attacks. “The ease of automatically generating accounts and repositories on GitHub and the like, using convenient APIs and easy-to-bypass rate limits, combined with the huge number of repositories to hide among, make it a perfect target for secretly infecting the software supply chain,” Apiiro wrote.
Shawn Loveland, chief operating officer of Resecurity, points out two additional issues. “The first is a trade-off between privacy and security: GitHub doesn’t scan repositories, but criminals can exploit them,” Loveland says. “And the other is simply the huge number of compromised GitHub accounts, which allows bad actors to access private repositories and then create duplicates of them.”
Cybercriminals can also copy public repositories without this additional access.
“I just looked in our database,” Loveland notes. “Nearly 100,000 PCs of users logged into GitHub have been infected with malware in the last 90 days.”
How can organizations protect themselves from the direct and downstream effects of a malicious GitHub repository? “Companies need to have a policy on using GitHub [that is] communicated with their employees and vendors, even if they themselves don’t use GitHub,” he suggests, because even companies that don’t interact directly with third-party code rely on developers at some point in their supply chain.
“Even a company that doesn’t have anyone using GitHub can still be a victim,” Loveland says.