Oh dear, we’ve built a spammers heaven

I was horrified while working on the Open Hub when all of a sudden I couldn’t get to the site.  I couldn’t get to any of the servers, nor could I get the site to render. I was working at home and VPN’d into the office, so I started checking other site.  NPR.org; nope.  Facebook; nope.  I quickly dropped off the VPN and tried OpenHub.net again. Nope.  Darn.  The other sites started coming through, but not the Open Hub.  I was getting “Server under heavy load” or a 504 error.  Not good.

I was able to get one of our IT Dev Ops on chat.  He didn’t see anything that looked line an overt attack, but the DB load was huge.  During our conversation, I was able to get back into the VPN and on to our servers and the site started to slowly respond again.

Yes, the DB load was high.  We upped the threshold of half the crawlers to try and get caught up with some of the out dated analyses. But we’ve done this in the past and it hasn’t pulled the site down.  What was different?

I looked at various reporting tools and saw that we had more than twice our typical traffic and most of that was new account creation. Oh, that can’t be good.  Not at all. Shortly after that, it was four times our regular traffic.  Then five times.

Account Creation Last Year

Here we have a graph of account creation over the last year.  Gray is the number of accounts created each day, blue is accumulated total and green is a smoothed average.  The highlight shows when we introduced the ability to block account creation by domain and blocked outlook.com.  Huge drop.  We were very happy.

But look at what’s happened since the beginning of 2015 — massive increase in daily account creation.

Right now (and I just ran the queries), we have 660,784 accounts (5 were created since I started writing) that are not spammers.  We have 88,705 that we have identified as spammers and suspect that over 200,000 additional accounts are spammers.  We talked about this nearly a year ago!  What if it were as high as 2/3 of all accounts?  440,000 spammers and 220,000 legit users.  It could even be over 3/4 of all accounts by some of our models.  That’s over 500,000 spammers and maybe 150,000 valid users.

We have to admit it; We’ve built a spam farm and we have to do something about it.  Here’s what we’re going to do:  We are going to stop letting anyone create accounts on the Open Hub.  Then we are going to send out email notifications to every user and ask them to re-verify their accounts.  Users will have some generous period of time to reverify their accounts.  Those that don’t will have their account flagged as a spam account.  We’ll hold that spam account for a generous period of time and if we don’t hear from the user directly to restore the account, it will be deleted.  We will roll this process into all new account creation with tighter timelines when we re-enable account creation.

In the mean time, we are having impassioned conversations about how to make the Open Hub utterly unattractive for spammers to create junk accounts while letting real folks create accounts.  How to tell?

Here are some things we’ve considered:

  1. Request regular account verification.  If you haven’t been on the site in a while, then we’ll ask your indulgence in occasionally verifying you are still paying attention to the email address we have on file.  The downside of this is that we will doubtlessly annoy folks.  And there will be real people who don’t have access to the email address we have for them who will have their accounts flagged and possibly deleted.  Or those who do have access to their email address but don’t want to be bothered by us, but still are members of the open source community.
  2. Build a reputation system and not let anyone provide a description or URL in their profile until they’ve earned enough credits on the site.  However, we don’t want to require that folks come back regularly if they don’t want to.  If they are making contributions to open source and working on the same projects, then why should we make them jump through more hoops?
  3. Require OAuth verification from an authoritative source like GitHub, StackExchange or even a  verified Google account.  The thinking is that the spammer would have to create and verify an identify on one of these sources before being able to create an account on the Open Hub and some of these sources have a financial interest in verified users as opposed to the Open Hub, which is a free service for the open source community.
  4. Let spammers create accounts, but let only the spammer see the edits they are making and don’t expose their account pages to anyone else.  This is brilliant, but when do we decide that someone isn’t a spammer and let their account be publicly available?  After some time?  Weeks?  Wouldn’t that really annoy you if you signed up and then we said, “You can edit your account, but no one will see it for a month”?  And would that really deter spammers who are going for long-tail availability of their links anyway?
  5. Use a third party provider to verify accounts at the point of creation and drop them immediately if they are flagged as a spammer.  We tried a prototype project with this approach using Akismet.  We supplied the login, name, email, any URL, and description and were getting less than 30% accuracy on supplied values.  It didn’t seem like enough.
  6. Require something only a real person could provide, like a credit card number.  We really don’t believe this would fly.  Here’s a free website that requires a credit card for you to join? It just reeks of thievery.
  7. Blocks by IP address. We’ve not seen a pattern that gives us confidence that we can correctly identify spam sources by IP.  Plus, most of our traffic seems to be originating in the US, but no one on the Internet knows you’re a dog.  By means of an example, my daughter, who is overseas at school, uses a free VPN to look like she’s in the US so she can watch Netflix.  How difficult could it be for people to use such services for nefarious purposes?
  8. Block by domain.  In the last 1,000 accounts created, there have been over 250 different domains.  I think the spammer forces can out-domain us far too easily.

Meanwhile, 12 more spam accounts have been created.

So, what’s the big deal?  Who really cares if spammers create accounts?  They come, create an account and never appear again.

Yes, we can try to clean up those spam accounts after the fact.  The reason it upsets me is that I am regularly asked for quality data about the activity of the open source community and I want to provide accurate information.  The noise is so great that I can’t see the signal.  I don’t think that’s fair to you, the member of the open source community.

Hey, if you have an idea, please contact us at info@openhub.net.

4 more spammers.

And I’m sorry it’s taking so long to get your projects analyzed and updated.  We’re working on it.  Thanks for being a member of the open source community and making it to the end of this post.

About Peter Degen-Portnoy

Mars-One Round 3 Candidate. Engineer on the Open Hub development team at Black Duck Software. Family man, athlete, inventor
  • Hi Peter!

    I feel your pain.

    So they seem to be tearing through reCAPTCHA here:
    https://www.openhub.net/accounts/new

    The most efficient (while remaining simple) method I have seen is a mandatory field with a simple random site-specific question (ex.: what color is our logo?). You can even put the answer in JavaScript which can be copy-pasted by legitimate users.

    Best regards,

    M 😉

    • Hi Marc!

      Thank you for your tremendous support.

      Yes, the reCaptcha isn’t providing much protection. While there are automated services that spammers use to break reCaptcha protection (and offer SLA’s on the order of 8 seconds per call), our analysis indicates that the majority of spammers are actual people.

      The scenario we envision is that there are rooms of low-paid individuals using VPN to appear that they are located in the US who use throw away email accounts to create accounts on open sites, such as the Open Hub, verify the email address, then populate the account with the marketing message they were paid to create and are never seen again. The individuals benefit by getting some wage in a country where low wages are the norm, the marketeers have a viable financial model to create the sites with links to their money sites and benefit from the scale of operation. And, since there are real people behind the account creation, they can answer captcha and other human-targeted verification questions.

      It would seem that it costs about $300 per month for these marketeers to use automated tools and these personnel farms, and they can generate thousands of dollars of return on this investment.

      My personal opinion is that until it costs the marketeers more to send email, which seems to be the starting point of their economic model, the model will continue to work against all the rest of us.

  • Why don’t you try using https://www.cloudflare.com/ ? They’re known to be really good at blocking spam, DDoSes, as well as caching data to reduce the load on your site. I personally recommend them highly 🙂

  • Volker Berlin

    What is the target of the spammers? If the target are links for Google then use a simple nofollow for links created from possible spammers. Then it make no sense for spammers.

    • Volker, the URLs in the Account page summary *are* “nofollow”, so the spammers do not get link credit for them. There are two main scenarios we believe contribute to this behavior:

      1) The links are put up by a marketing firm hired by legitimate businesses to perform Internet promotion

      2) The links are part of an effort by marketeers to get anyone to click on their links, for which they generate a small amount of revenue. Even a relatively small amount of traffic on the huge number of target locations (AKA “money sites”) generates sufficient profit to justify the effort.

  • ILoveEclipse

    Peter,
    what is about a mandatory OS repository link which must contain user email user to register here + timeout for write rights?

    User without such repository will get read-only access, user with this repository will have write access BUT only after the crowler visits the repo and confirms that it’s older then 1 month or so.

    Of course the repository is easy to create, but this mean that:
    1) The spammers will need a service which allows unlimited number of repositories per user
    2) Wait a month for write rights
    3) Keep the faked mail at least for month
    4) Money per hour ratio will be much lower (not one confirmation mail but two etc).

    Most people have their mails for OS development in the commits anyway, so should not be a big deal?

    Keep up the service running, it is really cool.
    Andrey

  • David Martínez Moreno

    Hello, Peter. You can try https://www.stopforumspam.com/. In my limited experience, I was able to match most of my spammers (I’d say more than half of it) from their DB. They have a mix of source addresses and IPs, so you can possibly find more signal in their dataset.

    • David Martínez Moreno

      But to be honest, at your scale I wouldn’t invest too much in the first stage of the attack (the account creation). I think that you should really invest into some ways to ban posting content full of URLs.

      In fact you should start reviewing content right after posting and before making it public. Get all the URLs in the post (I understand that that’s the problem), and then try to find their reputation in https://www.surbl.org or any other similar service. Or even put that through a Bayesian filter to learn what spam looks like. Whatever.

    • Hi David, Thanks so much for these suggestions and thoughts. I’ve added them to an internal document we’re discussing tomorrow.

  • Antoine P.

    It really seems openhub is accumulating issues. Two severe ones:
    * loading some pages is slow as morasses: for example https://www.openhub.net/p/numba/contributors/summary took ~ 30 seconds here
    * many projects don’t have their stats updated in time (latest Numba update was one month ago, while we have commits almost every day)

  • me

    Require claiming a contribution prior to registration, and/or remove the option of homepage links altogether? I have never had problems to find the homepage of some developer who has a homepage anyway.

    Currently, the page is too slow to be really usable. 🙁

    • Hi, Thanks for your suggestions. We’ve had a similar thought of removing the URL as well from the Account page; perhaps we should solicit feedback from additional users about how they would feel about that.

      We understand that performance is poor. The site is really quite responsive when we stop all analysis. But then we don’t have the up-to-date analysis our users expect from us.

      We are working with some expert consultants to see if there are things we can do in the short term to improve site performance while continue to generate new analysis. Additionally, we are looking at our production schedule to see if we can make progress on our database re-architecture work sooner this year. If there are concrete changes, we’ll let everyone know.

      • Kaz Nishimura

        Do your code analysis processes lock DB read so often? I hope you can find a solution.

        • it’s not that there are DB locks, it’s more that the DB has to fulfill two very different purposes — (1) be optimized for analyses and (2) be optimized for a web application. These purposes require structures that are rather in opposition to one another.

  • CometVisu maintainer

    Is there a work around to create a new account?
    I’d really like to add my project https://www.cometvisu.org/ but I can’t w/o an account 🙁

  • Александр Шишенко

    Well, OAuth will do very nicely, I think.

  • Gaga

    This was 41 days ago. When will I be able to create a new account?

    • WhenWillItBeBack

      I agree. I tried to sign up today and got the message that registration is denied. I just wanted to explore their API, pretty much for fun. Sigh.

    • We have an intended solution and are working on it. Shouldn’t be much longer. I apologize for the inconvenience; it was a pretty radical thing to do.

  • Maniac

    Would have been nice to use your site, but I can’t register since registration was turned off almost 2 months ago.

  • Jan

    A combination of 1 & 3 wouldn’t bother me, worth it for the insight that openhub gives.

    • Thanks for your feedback. We are working on re-enabling new account creation and greatly appreciate your patience and support.

  • Hervé

    I currently have the EXACT same problem on a Drupal website I took over some time ago.

    There were about > 100 spam accounts being created every day. I blocked some domains, added honeypot and stopforumspam.com verification which helped a quite a lot. But spam was and is still hitting the server and creating accounts.

    So I’m still working on the spam cleanup and prevention:
    – to prevent: Content verification services like Mollom is not an option for me unfortunately but it could have worked really well. So I’m about to replace the ReCaptcha (seems really uneffective to bots) with logical questions similar to https://textcaptcha.com/. See some here https://api.textcaptcha.com/myemail@example.com.xml

    – to cleanup: stopforumspam helped just a little. I then deleted or blocked (if not 100% sure) a lot of accounts by checking the content in database. There were still a big amount of suspected spam users and I did the same as you (send email to validate account).

    So I’ll see if logical questions work against spam in the upcoming weeks. I have real hopes.
    You might consider this as well or maybe a phone (sms) verification or even content verification services like Mollom?