Cover image https://github.com/x0rz/phishing_catcher

How to detect phishing websites in real-time using open source

Bradley's author profile picture
Bradley Kemp on

The key to successfully combating phishing is detecting it early: the sooner you can report a phishing site to the hosting providers, the fewer people will fall victim to it. But you don't need any expensive services to do this, it's possible to build your own phishing detection system for free using open source tools.

How? By using Certificate Transparency logs.

What are CT (Certificate Transparency) Logs?

Certificate Transparency (CT) logs are public databases of every HTTPS certificate issued by publicly-trusted Certificate Authorities (i.e. certificates that work in standard web browsers). They were originally introduced to address the issue of fraudulent or maliciously issued certificates, but have turned out to be extremely powerful for detecting phishing.

Because Chrome and Safari browsers will only trust an HTTPS certificate if it's been submitted to at least one CT log, all public HTTPS certificates get submitted to these logs as soon as they're issued. So, unlike newly registered domains lists which only get updated once a day, new domains appear in CT logs almost instantly. And they contain subdomains, not just root domain names.

Best of all, Certificate Transparency logs are completely open, so you can download their contents for free.

The phishing_catcher project

phishing_catcher is an open source project which scans CT logs for domains containing multiple suspicious keywords.

In its default configuration, phishing_catcher is set up to generically collect potential phishing domains using a scoring system based around generic security-themed keywords:

keywords:
# Generic suspicious
    'login': 25
    'log-in': 25
    'sign-in': 25
    'signin': 25
    'account': 25
    'verification': 25
    'verify': 25
    'webscr': 25
    'password': 25
    'credential': 25
    'support': 25
    'activity': 25
    'security': 25
    # ...

But it's easy to configure this scoring system and tailor the domains it detects to focus on your brand specifically.

Configuring phishing_catcher to detect phishing for your brand

Cloning the repository

phishing_catcher is a Python project which needs to run either locally on your laptop or ideally on a server where it can run uninterrupted.

The first step to setting it up is cloning the repository from GitHub:

$ git clone https://github.com/x0rz/phishing_catcher.git

Now you'll have a folder named phishing_catcher containing two key files:

  • catch_phishing.py: the script that consumes the CT logs looking for domains
  • suspicious.yaml: the keywords and associated scores used by the script for determining what is suspicious

Setting up the config file

If you open suspicious.yaml you'll see it contains a bunch of keywords that probably aren't relevant to your brand.

Instead, you'll want to replace it with a config containing keywords and scores in three categories:

  • Your brand name (and variations of it). dnstwist is a tool that can help with this.
  • Industry specific keywords. For example, a bank might include keywords like "payment" and "balance" as these are themes that phishers might commonly use in the lures they send out to customers.
  • Other generic keywords like "login" and "support" that phishers might include in their domains.
keywords:
  # Variations on your brand name (scored highly)
  acmebank: 50
  acme-bank: 50
  acme: 50
  # ...

  # Industry-specific keywords
  payment: 25
  payee: 25
  statement: 25
  balance: 25
  card: 25
  # ...

  # Generic keywords
  login: 25
  log-in: 25
  sign-in: 25
  signin: 25
  account: 25
  support: 25
  # ...

Your keywords and scores will absolutely evolve over time as you detect phishing sites and learn what keywords are commonly used but these are a good starting point.

Bear in mind, phishing_catcher will only notify you about a domain if the total score (by summing the relevant keyword scores) is greater than 65.

Running phishing_catcher

Running phishing_catcher is simple. Either using Docker:

$ docker build . -t phishing_catcher
$ docker run phishing_catcher

Or directly on your system:

$ pip install -r requirements.txt
$ ./catch_phishing.py

You'll see a line telling you that it's established a connection to CertStream (the way it consumes Certificate Transparency logs) and then a subsequent line for each domain it detects.

Productionising phishing_catcher

phishing_catcher is a great proof of concept, but on its own isn't a production-ready detection system:

  • It only prints detections to the console, so you'll need to add some way to save these domains and raise alerts to review them.
  • It doesn't have any way to save its progress through CT logs, so you'll need to keep it running continuously otherwise it will miss domains created in the time it wasn't running (hence it being a bad idea to run it on your laptop).

Thankfully, the logic is very simple, so you can fairly easily adapt the catch_phishing.py script, or even write your own.

Upgrading your detection powers

The system described here can be extremely effective, but over time you'll likely find the performance weak in a couple areas:

  • Depending on your brand name, false positives can be very common: this isn't going to detect "Apple" phishing sites very effectively!
  • You'll be alerted to a lot of domains which are simply squatters hoping to sell you the domain and who aren't actually going to use it for phishing.

Both of these can be solved by analysing the website actually being hosted on each domain, not just the domain name. If this is something you're struggling with, we'd love to chat about how our enterprise plan can help.

Want more insight into phishing kits?
Start a trial today.

More posts from the Phish Report team

Cover image

How to harden your login page against cloning

Cloning a login page in order to make phishing sites takes only a few seconds. There's a good chan...
Cover image

Why it's hard to identify who hosts a website

Who hosts that website? Seems a simple question, but it's very hard to consistently get the right ...
Cover image

Using IOK rules to hunt for phishing sites across multiple threat intelligence sources

IOK ("Indicator of Kit") is a small, [open source](https://github.com/phish-report/IOK) language d...