Blog Engine Spam Cleanup

7. January 2010

After reviewing my web logs I’ve noticed a big uptick in the number of external links to spam sites. I knew about the spam bots posting tens of hundreds of comments on this blog and I promptly shut down comment postings. The bad thing is I authorized all of them to appear not realizing I had set comments to being moderated, and promptly forgot to authorize any comments. That’s why nobody ever saw or heard from me in reply to legit comment postings. ~Sorry~

I took it upon myself to write a blog engine comment spam cleanup utility. It’s basically a console application that just runs through the XML postings (not database driven) using LINQ to XML to query comments looking for criteria on Author, Email, IP, and Website. These elements exist within the posting files within the comment element, and was core to the filtering process.

Essentially I just created a new XML filter file called ‘blocked.xml’ that I created using the following structure:

   1:  <block>
   2:      <emails>
   3:          <email>...</email>
   4:      </emails>
   5:      <authors>
   6:          <author>...</author>
   7:      </authors>
   8:      <ips>
   9:          <ip>...</ip>
  10:      </ips>
  11:      <websites>
  12:          <website>...</website>
  13:      </websites>
  14:  </block>

From there I load each section into a generic list, using LINQ, and loop through creating a new XDocument for each posting file looking for the comments element’s child comment elements. It queries each comment section in each post looking for any comment that contains emails/ips/author/ or websites contained in the generic lists created from the block list file. Upon finding an offending element it removes its parent element, being the comment element, and removes it from the comments element. Thus removing forever the spam that currently exists from all posting files based on the block list.

The problem is building out the block list. Don’t want to arbitrarily remove valid comments. Luckily most of the spam bots that have taken root on the block share many of the some websites even though email addresses are often random for the same bots. I’m still building out my block list and at this point only have a hundred or so blocked elements added. I will need to most likely script out the creation of the block list so I don’t have to manually build the list. Still so far after manually building the list for an hour it’s dropped the existing spam count by a quarter of what it used to be. Not finished, but it’s getting there.

All of this work brought out an interesting fact that there is no option to remove all comments from view in the blog comment settings, which would have been nice to reduce external linking to websites with T & A related content. You’d think since it can enable and disable allowing comments there would equally be a show/hide all comments setting for those not looking to ever allow comments. Probably pretty niche, but still would be nice for just such an spam infestation occasion.

Programming, Programs & Utilities

Add comment




biuquote
  • Comment
  • Preview
Loading