I was horrified while working on the Open Hub when all of a sudden I couldn’t get to the site. I couldn’t get to any of the servers, nor could I get the site to render. I was working at home and VPN’d into the office, so I started checking other site. NPR.org; nope. Facebook; nope. I quickly dropped off the VPN and tried OpenHub.net again. Nope. Darn. The other sites started coming through, but not the Open Hub. I was getting “Server under heavy load” or a 504 error. Not good.
I was able to get one of our IT Dev Ops on chat. He didn’t see anything that looked line an overt attack, but the DB load was huge. During our conversation, I was able to get back into the VPN and on to our servers and the site started to slowly respond again.
Yes, the DB load was high. We upped the threshold of half the crawlers to try and get caught up with some of the out dated analyses. But we’ve done this in the past and it hasn’t pulled the site down. What was different?
I looked at various reporting tools and saw that we had more than twice our typical traffic and most of that was new account creation. Oh, that can’t be good. Not at all. Shortly after that, it was four times our regular traffic. Then five times.
Here we have a graph of account creation over the last year. Gray is the number of accounts created each day, blue is accumulated total and green is a smoothed average. The highlight shows when we introduced the ability to block account creation by domain and blocked outlook.com. Huge drop. We were very happy.
But look at what’s happened since the beginning of 2015 — massive increase in daily account creation.
Right now (and I just ran the queries), we have 660,784 accounts (5 were created since I started writing) that are not spammers. We have 88,705 that we have identified as spammers and suspect that over 200,000 additional accounts are spammers. We talked about this nearly a year ago! What if it were as high as 2/3 of all accounts? 440,000 spammers and 220,000 legit users. It could even be over 3/4 of all accounts by some of our models. That’s over 500,000 spammers and maybe 150,000 valid users.
We have to admit it; We’ve built a spam farm and we have to do something about it. Here’s what we’re going to do: We are going to stop letting anyone create accounts on the Open Hub. Then we are going to send out email notifications to every user and ask them to re-verify their accounts. Users will have some generous period of time to reverify their accounts. Those that don’t will have their account flagged as a spam account. We’ll hold that spam account for a generous period of time and if we don’t hear from the user directly to restore the account, it will be deleted. We will roll this process into all new account creation with tighter timelines when we re-enable account creation.
In the mean time, we are having impassioned conversations about how to make the Open Hub utterly unattractive for spammers to create junk accounts while letting real folks create accounts. How to tell?
Here are some things we’ve considered:
- Request regular account verification. If you haven’t been on the site in a while, then we’ll ask your indulgence in occasionally verifying you are still paying attention to the email address we have on file. The downside of this is that we will doubtlessly annoy folks. And there will be real people who don’t have access to the email address we have for them who will have their accounts flagged and possibly deleted. Or those who do have access to their email address but don’t want to be bothered by us, but still are members of the open source community.
- Build a reputation system and not let anyone provide a description or URL in their profile until they’ve earned enough credits on the site. However, we don’t want to require that folks come back regularly if they don’t want to. If they are making contributions to open source and working on the same projects, then why should we make them jump through more hoops?
- Require OAuth verification from an authoritative source like GitHub, StackExchange or even a verified Google account. The thinking is that the spammer would have to create and verify an identify on one of these sources before being able to create an account on the Open Hub and some of these sources have a financial interest in verified users as opposed to the Open Hub, which is a free service for the open source community.
- Let spammers create accounts, but let only the spammer see the edits they are making and don’t expose their account pages to anyone else. This is brilliant, but when do we decide that someone isn’t a spammer and let their account be publicly available? After some time? Weeks? Wouldn’t that really annoy you if you signed up and then we said, “You can edit your account, but no one will see it for a month”? And would that really deter spammers who are going for long-tail availability of their links anyway?
- Use a third party provider to verify accounts at the point of creation and drop them immediately if they are flagged as a spammer. We tried a prototype project with this approach using Akismet. We supplied the login, name, email, any URL, and description and were getting less than 30% accuracy on supplied values. It didn’t seem like enough.
- Require something only a real person could provide, like a credit card number. We really don’t believe this would fly. Here’s a free website that requires a credit card for you to join? It just reeks of thievery.
- Blocks by IP address. We’ve not seen a pattern that gives us confidence that we can correctly identify spam sources by IP. Plus, most of our traffic seems to be originating in the US, but no one on the Internet knows you’re a dog. By means of an example, my daughter, who is overseas at school, uses a free VPN to look like she’s in the US so she can watch Netflix. How difficult could it be for people to use such services for nefarious purposes?
- Block by domain. In the last 1,000 accounts created, there have been over 250 different domains. I think the spammer forces can out-domain us far too easily.
Meanwhile, 12 more spam accounts have been created.
So, what’s the big deal? Who really cares if spammers create accounts? They come, create an account and never appear again.
Yes, we can try to clean up those spam accounts after the fact. The reason it upsets me is that I am regularly asked for quality data about the activity of the open source community and I want to provide accurate information. The noise is so great that I can’t see the signal. I don’t think that’s fair to you, the member of the open source community.
Hey, if you have an idea, please contact us at email@example.com.
4 more spammers.
And I’m sorry it’s taking so long to get your projects analyzed and updated. We’re working on it. Thanks for being a member of the open source community and making it to the end of this post.