My SOC Methodology

This is my process for triaging and investigating alerts in a SOC. It's a bit high level and assumes you already know some cybersecurity basics and the basics of how your SIEM works, how to read logs, etc. It's derived from my own experience using SIEMs and is additionally integrated with theoretical frameworks that are commonly taught in academic settings. It aims to efficiently create high value deliverables for the recipient of the SOC's output, regardless of whether that's a 3rd party client, internal security or even a hobbyist self-hoster who just wants to monitor their very own SIEM.

This work also aims to be platform agnostic, and so is not tied to any specific SIEM product, company, methodology, etc. It is purely my own process for examining alerts in a SIEM/SOC context generically.

My thought process summarized:

Basically, when I see an alert, I step through the following basics:

  1. Do I know what the alert means? What in the rule logic caused it to trigger? Is this a false positive?
  2. Does the alert provide indicators that detect something from the cyber killchain?
  3. Considering the context, is there a defense in depth based remediation to be made?
  4. If no, can the rule be tuned to not trigger on useless conditions like this?
  5. Once I do confirm that the alert needs escalation, what do I need to escalate?

Do I know what the rule means?

This is a key aspect of using any SIEM. How to discover this varies by platform, and generally comes with experience and exposure. Once you've internalized what the rule is detecting for the majority of rules in your environment, though, triage and investigation can become much faster and effective.

Tend to ask and answer the following:

  • What is the log source of the rule?
    • SIEMs work by ingesting logs from endpoints on a network.
    • Could the device the log is from actually have useful artifacts?
  • What do the logs and fields returned by the rule mean?
    • Simple as: do you know what the data you're looking at actually means?
    • This may require looking at vendor documentation if you're unsure.
  • Is the data sufficiently parsed?
    • SIEMs can only make good detections from well-parsed data.
    • If the data isn't parsed well, what do I need to consult to get the data I need to spot an indicator?
  • What is this rule supposed to detect?
    • Does that activity fall on the cyber killchain or into MITRE?
    • Does the rule's pattern actually retrieve indicators of that activity?
  • What does the rule actually detect?
    • Does the detection as written target strong indicators?
    • Does it go for weak indicators that have a high FP chance?
    • If the rule is broken and doesn't actually detect what's on the tin, does it still detect any useful indicators for potentially malicious activity?
    • What does the logic powering the rule actually do? Is it broken?

If the rule isn't capable of detecting what it's actually named for, it may be worth tuning immediately. That said, it doesn't explicitly excuse not escalating the activity -- while it may be a broken rule that doesn't detect what it's supposed to, it may still be able to detect some suspicious activity by chance.

If the logic, log sources and other issues make sense, we need to determine if the detected activity falls onto the cyber killchain. If it does, then we want to start our hunt to possible indicators that prove that this is actual malicious activity. That's what most of the rest of the article is about!

Does it fit in the killchain and do I have IOCs?

This is where the real "security" work begins. Once we can see that the rule is detecting relevant information, we can dive and and make a determination as to whether there's something worth investigating here.

Cyber Killchain

The cyber killchain is a basic framework for understanding the "steps" of an attack. There are many out there, some more comprehensive than others, but I personally like Lockheed's cyber killchain best. This is mainly because it is short and succinct, and so can be easily memorized. More extensive killchains may more accurately reflect more real-world cases, but because they're more extensive, they cannot be memorized easily, and thus cannot be used to make fast decisions as when performing first line triage or when trying to rapidly sort massive amounts of information during and investigation.

The links of the killchain are as follows:

  • Reconnaisance
    • This will be where a threat actor attempts to explore vulnerabilities in a system. Examples:
      • Nmap/port scans
      • Web application fuzzing
      • Service enumeration techniques
      • Username enumeration
  • Weaponization
    • This is the process of turning a vulnerability into an actually deliverable payload. You won't see the process of weaponization in se, but may see weaponized code as a host or network indicator. Examples:
      • Shellcode
      • malicious scripts
      • .lnk droppers
      • Malicious Office macros
      • Wordlist building for password attacks
  • Delivery
    • This is how the attacker actually gets his malicious object into the environment.
      • Phishing emails
      • Malicious websites
      • Crafted packets coming directly to an exposed service
      • Credentialed access (after a password attack)
  • Exploitation/Execution
    • This is when a vulnerability is leveraged to allow for malicious software to execute.
      • Application crashes (could be an indicator of a buffer overflow)
      • LOLBin execution under unusual circumstances
      • Script engines running unusual code
      • Defense evasion TTPs (process injection, obsfucation, AMSI bypass, etc)
  • Installation
    • This is when malware installs itself after initial execution
      • Dropper retrieving stages from C2 infrastructure
      • Registry persistence
      • Modification of DLLs, services, or system files
  • Command and Control
    • This is communication between external assets held by the threat actor and the malware within the environment
      • Traffic to unusual hosts
      • Encrypted traffic
      • Traffic on unusual ports
      • Signs of beaconing activity
  • Impact on Objectives
    • This is the threat actor doing what he intends to do within the environment
      • Signs of the killchain restarting internally
      • Pivoting
      • Lateral movement
      • Data Exfiltration
      • Mass resource access (could be an indication of ransomware, datatheft, etc)
      • Stealers (e.g. Mimikatz, Lazagne), cryptominers, crypters, in the environment

Note that because the Lockheed killchain is short, in order to fully describe an attack, we may need to think of linking several killchains together. I.e. for a first stage dropper, "impact on objectives" may simply be to establish a beachhead presence and then start doing enumeration for privilege escalation. Once that's acheived, the impact may shift to attempting to enumerate for opportunities to pivot or do lateral movement through the environment. It might not be until you've reached the end of this third killchain that you finally start seeing the threat actors end-game activities like cryptomining, infostealing, ransomware, etc.

Instead of going over silly examples myself to try to explain what this looks like, I would instead direct you to The DFIR Report. This site presents a collection of case studies on real world attacks that lay out the timeline of events from initial access to final impact on objectives.

In any case, if the activity in an alert could plausibly represent one of the specific stages on your killchain, then we need to, at minimum, investigate further in order to hunt for indicators of compromise.

If it does not indicate any sort of activity on a killchain, the rule may warrant tuning or even removal depending on circumstances. Basically, if it's not detecting either an availability problem or a killchain item, it sorta begs the question as to what the rule's point is.

Probing for IOCs/Pyramid of Pain

What we want to do once we confirm that the rule is likely detecting activity related to a link in the killchain is to manually confirm that we have indicators of compromise. These are artifacts we can use to verify that actually is the activity being described by the rule + whether the activity appears to be legitimately malicious.

A good starting point for identifying these would be the "Pyramid of pain" model, which gives us a list of indicators that can be ranked by their difficulty to obfuscate. These are (from strongest to weakest):

  • TTPs
    • The actual behaviours used to accomplish a task (but can be vague to establish from SIEM logs).
    • You'll usually only see this indirectly, e.g. through examining several host/network artifacts that indicate that a certain TTP was attempted.
  • Tools
    • The software used to perform TTPs
    • Again, often detected indirectly through host and network artifacts, or hashes.
  • Network and Host artifacts
    • This could be anything from protocols, file paths, registry keys, command lines, etc. If you look up a report on a type of malware, you'll likely find lists of this sort of thing.
    • These are often your bread and butter in the SIEM, as they're the sort of thing that will show up in ingested logs.
    • Unfortunately, you may lack the visibility to see them, depending on how your monitored environment is setup.
    • You'll often find documentation for these in advisories, whitepapers, etc.
    • May also be worth checking github for publically known tools and malware to see if you can devise what artifacts would look like from the code.
  • Domain names
    • Domains associated with delivery and C2 activity.
    • Found on threat intel platforms like AbuseIPDB, Alien Vault, etc.
  • IPs
    • IP addresses used for delivery and C2.
    • Found on threat intel platforms like AbuseIPDB, Alien Vault, etc.
    • Fairly weak indicator
      • Try to confirm recent malicious activity reports.
      • Also, ensure multiple sources confirm malicious activity.
  • File hashes
    • The specific file hash of a given file artifact.
    • Virus Total is a good reference here, but not perfect.
    • Very weak indicator; hashes can be changed trivially.

Usually, I'll check threat intelligence source like AbuseIPDB, Alien Vault and VirusTotal for information about any number of these activities, especially IPs, hashes, and domains. The sputnik browser extension is useful for this, but also plain googling too. For host and network indicators, I may simply just google blindly to see if there are any matches for whitepapers that list the indicator that I've found as being related to known malware.

Bear strongly in mind that if you're using crowdsourced threat intelligence, try to have "2 witnesses" for indicators. Publically crowd source threat intel can be as useless as Youtube comments at worst, so you want to ensure that malicious activity is attested by multiple sources, preferably on multiple platforms before you determine that an indicator is truly malicious. I.e. if I see only one guy on one threat intel platform saying that an IP is bad, I'm probably going to ignore it, because there's very little attestation to the malicious behaviour.

On the other hand, tailored indicators presented by a professional source in a recent whitepaper would be a pretty "strong" indicator for something like host and network artefacts.

Overall, I want to confirm that the rule picked up on a strong malicious indicator. If it did, I may have enough to start considering remediations and how I want to escalate things. If not, or if the indicators are weak, I may need to dig deeper by pivoting to other logs with the data I have.

Finding "pivotables"

In a case where I cannot immediately find a strong IOC with the initial logs associated with an alert, I may need to "pivot" my investigation into other logs to see if anything can be discovered through alternative means. For example, if malicious network traffic is detected and I have IPs, but these end up being only weak indicators, I might pivot off of the victim IP to see if I can discover the host name or an associated user account. Once this is found, I may try something like pulling host logs to see if there are any host indicators that can be used to corroborate the idea that I've detected malicious activity. Or, if I see a strange process, I may pivot on the process GUID to see if I can find an indicator somewhere else in the family of parent and child processes of a given piece of activity.

As a sort of short list of things I try to pivot on in different cases:

  • GUIDs
    • These tie together objects under a unique identifyer that can often relate different things such as processes, threads, etc.
  • Parent Process
    • Similar to above. If a process calls child processes, you'll find them all lumped under one parent.
  • Hostname
    • If all you have is an IP from a network log, you may be able to pivot into host logs like this
  • Username
    • You may be able to pivot from network to host with this.

The list could really be endless. Just creatively think about what sorts of information could be directly linked to information that exists in other types of logs.

If I cannot find any pivotables, I'm in a blindspot, and have to make an executive decision:

  1. If my initial indicators from the alert were "strong", but I cannot find confirmation that the attack succeeded or failed, it may be a good idea to keep considering escalating and request confirmation of the activity.
  2. If my initial indicators were "weak", and there is no further evidence of plausibly related killchain activity, it is likely safe to suppress.
    • For establishing "plausible relation" consider:
      • Timeline
      • Originating hosts
      • Originating users

Don't be afraid to suppress if all you have is a weak indicator. You can end up in a very paranoid place trying to correlate events that are 100% unrelated. This can be a tremendous waste of time, and the longer you sink time into the investigation, the more likely you are to experience sunk-cost bias and try to force your logs to point to a true positive that just isn't there. Escalating this kind of activity can be a major loss of time and money for the recipient and can contribute to a feeling that you have a tendency to "cry wolf", which ultimately leads to less trust in the SOC and thus a riskier security posture.

Disconfirming indicators

If at anytime during my investigation I find one of these, I'll consider it a disconfirmation of malicious activity and pretty much drop my investigation/consider the activity benign:

  • WTFBins or management software that is confirmed to be expected in the environment.
  • Vulnerability scans/malicious traffic coming from confirmed vulnerability scanners
  • Malicious activity from confirmed and documented penetration testing sources

Obviously, this stuff could be dangerous if it is not confirmed by the client, so I'll need to check prior documentation to ensure that stuff from the above is indeed expected. If it is not confirmed by the client, it needs escalation; tools like this can very much be used by threat actors to do bad stuff, and using off the shelf tools could be a way of trying to blend in with expected activity.

Note that I don't necessarily consider "invulnerability" to be a total disconfirmation. While it can be something to note to the client to reassure them that they're not compromised, there still may be actionables related to the organization's overall security posture that wind up being exposed by the event. E.g. when a threat actor probes your IIS server for an Apache vuln and it hits the server, sure, it's not likely to have been a successful compromise...but your IDS is still misconfigured and let signatured malicious traffic cross your DMZ without blasting that nonsense out of the sky, and we need to talk about that.

Same with external scans from non-malicious sources (e.g. Shodan, Censys, etc). While there is likely no hostile intent from such activity, it can still expose defense in depth issues that need remediation.

Is there a defense in depth based remediation?

Defense in depth is a 7 layer model for how overlapping security controls should be applied to an environment for maximum efficacy.

The model looks like this:

  • Policy
    • This is stuff like password length rules, acceptable use policies, seperation of duties, etc.
    • You may make suggestions re: password policy from time to time.
  • Physical
    • This is related to the physical locks and hardware that secures physical, on-prem assets.
    • In MSSP land or in a situation where you're monitoring off-prem assets, you rarely have much to say here, and physcial detections in a SIEM generally tend to be limited to things like USB inserts.
    • This is mostly handled by facilities.
  • External Network
    • This is going to be stuff related to border firewalls that divide the internal network from the public internet.
      • Are they configured appropriately to drop incoming malicious traffic?
      • Do they drop outbound C2 traffic?
      • Are your external management ports properly hidden behind VPNs and locked down to trusted hosts? Etc.
    • This is a very bread and butter place that you'll find major misconfigurations.
  • Internal Network
    • This is more or less related to firewall policies between internal hosts.
      • Are they properly configured to be able to deny pivots and lateral movement through the network?
      • Are VLANs and subnets properly segregated from one another?
    • This is a hard spot to offer good advice in because lots of stuff inside of a network can trigger false positives. E.g. RMM tools that are used internally can strongly resemble RATs being used for pivoting.
  • Host
    • This is stuff on the host itself.
    • HIDS, host firewalls, EDR, etc.
      • Is it up and running?
      • Is it properly configured?
      • Are the server applications themselves properly hardened?
  • Application
    • From a SOC perspective, this is mostly going to involve versioning.
      • Are applications on the server known-vulnerable versions?
    • This gets more detailed when dealing with appsec itself as a discipline, though, and at that level will involve things like code auditing for best practices and detecting and correcting vulnerable coding practices.
  • Data
    • This is going to involve whether or not sensitive information (either at rest or in transit) is properly secured.
    • Usually, you'll look for failures to use encryption when it's appropriate to do so here.

If an event exposes that there's something that needs to be tightend up in order to have perfect defense in depth, then I should have clear remediation advice to give to the client upon escalation.

If I cannot think of anything wrong with defense in depth for the organization...then it's debatable whether there's an issue. If defense in depth is so perfective in an environment that no improvements can be suggested, then it is simply unlikely that the attack was not successful, and similar future attacks are likely sufficiently guarded against. There is nothing to tell the client, and so the issue can be suppressed. I.e. a million exploits bouncing off the border firewall may represent real indicators of attempted compromise...but the fact that they're being denied means nothing happened and there's nothing to tell the network team to change.

Obviously, don't get arrogant, though. Just because you can't think of a remediation doesn't mean something isn't wrong. Be sure to consult others if that's a possibility and you feel dubious about something.

Can the rule be tuned?

In order to escalate, I want to see:

  • Definitive killchain related activity
  • Strong indicators
  • A defense in depth remediation

If I don't have these, I need to consider whether the rule can be tuned.

Some thoughts I usually have for tuning rules:

  • Does the rule have any logical errors?
    • Poorly formatted booleans, incorrect field parameters, etc.
    • This stuff can totally break a rule and make it fire on garbage.
  • Is the activity the rule is intended to detect actually killchain activity?
    • I.e. is the rule designed to detect actually dangerous activity?
    • If not, why do we have the rule?
  • Do the indicators detected by the rule's logic actually detect the killcahin activity it's designed to detect?
    • If not, do we have the parsed data to actually make that detection?
  • How high up the pyramid of pain am I making detections?
    • Rules based on lower tier indicators like IP addresses, hashes, or even specific file names/paths (a common host indicator) are probably going to be garbage compared to rules that target TTPs.
    • Conversely, TTPs are hard to write rules for.
    • Consult MITRE ATT&CK for detection advice for different TTPs.
  • Do I have parsed data enough to actually make the detection I need?
    • If not, could I get it with a better integration or host agent? Consult the client if this is the case.
    • If not, the "bad" rule may simply need to stay in order to be a starting point for a human analyst to make an investigation from.
    • Whether to remove or keep will depend on whether the rule triggers on false positives so often as to make it a fatigue-source for the analysts, and whether the true positive case is actually critical.
  • Does the rule consider whether the activity was remediated or not?
    • E.g. does the rule consider whether the firewall dropped the traffic or not?
    • Does the rule integrate with other software to detect whether execution was halted, e.g. by EDR or other filters?
    • If not, do we have the parsed data to make this determination?
  • Can an organizational level exclusion be built for the rule?
    • I.e. to not trigger for a given IP, host or user due to it being a known benign source of activity? This will need consultation with the client.
    • With permission, we can generally exclude lower pyramid of pain items without compromising the rule's ability to make good detections.

In a best case scneario, we should be able to tune a rule to the point that it almost never produces a false positive, or at least always produces output that warrants a deep, host level investigation. At this point, the rule can be added to whatever automation your SIEM is capable of for automatically escalating to the client. This will usually only be possible if we have very good parsed data available that nearly always gives us a definitive TTP.

The meh-case scenario would be to simply tune the rule to be less "noisy" for the human analysts. This will usually be the case if we have middle of the pyramid data that demands investigation and human triage to make a determination. Being middle of the pyramid, an equal volume of false and true positives can occur, which is why the human analyst is needed.

The worst case scneario is when an alert has bottom of the pyramid indicators, but represents a critical issue that would absolutely need to be reported if detected. These sorts of alerts tend to cause alert fatigue and have a tendency to slip through the cracks due to having very high false positive rates. If parsed data is really this low quality, even human investigations may not be capable of positively identifying true positive activity. This is the abstract sort of hell that causes breaches!

Actually escalating activity

If I do have the three factors I want to see in order to escalate, then I will. While on a professional SOC, you're probably going to use a template to build out your escalation. A good template will allow you to express 3 things in my opinion:

  • What are my indicators?
    • What is the affected host?
    • The source and destination IPs?
    • Host indicators like process IDs, file paths, etc?
    • What does threat intelligence or other researcher output say?
  • What do the indicators imply?
    • Why is this a dangerous situation?
  • What needs to be done?
    • I.e. how do I improve defense in depth?
    • If there's evidence of an actual compromise, what's good incident response advice?
    • If I lack the data necessary to definitively identify a compromise, what does the client need to do to facilitate the investigation?

If your template does not allow you to express these three things, then your escalations likely are not providing much value to whoever has to actually read them.

Conversely, if your template contains much more than these three fields, it likely contains superfluous information that makes generating the escalation by the analyst inefficient, and makes reading the escalation confusing and less effective for the recipient.

Remember: your escalations are not just things you make to show your boss that you're not asleep! They are the deliverable that the client or your internal teams relies on for advice for how to fix security their issues.