Apache Solr Driven Keyword Searching in Autopsy
Keyword searching is a common and widely used investigation technique across all varieties of digital investigations. On the surface, it seems fairly straightforward - figure out what names, places, things, activities, applications, etc. you want to search for and perform the search. Simple right? What about misspellings? Or patterns of text versus an exact match? Many of the commercial digital forensics tools help solve some of these considerations and so does Autopsy - but it does it for free. Autopsy uses the open source Apache Solr and Tika libraries for fast and efficient text indexing and searching.
There are two basic times when you can keyword search in Autopsy.
- Keyword lists can be used during ingest when the files are being added to the case.
- Individual ad-hoc queries can be performed during the course of the analysis.
These two methods can actually happen at the same time during an investigation because you can start your analysis while the ingest process continues to process files. If you find keywords that you want to add to your search, they are added to the ingest list and will occur in the background. This means you don’t have to wait to start your investigation until your tool finishes processing. Why wait? Investigate!
Ingest modules and Near Real-time results
When adding a disk image to a case, the user chooses the ingest modules that they want to run on the image. One of the standard modules is Keyword Search and it will extract the text from files and add them to a Solr index. It will periodically, default is every 5 minutes, query the index for a list of keywords that the user configures.
The user can search for plain text search strings (like “Jesse James”) or patterns (regular expressions - REGEX). Autopsy comes with a list of pre-defined regular expressions to find phone numbers, email addresses, IP addresses, and URLs. A common strategy is to load up the keyword lists with common misspellings of important words or develop REGEX that will find close matches words or patterns of interest.
Autopsy’s ingest modules run in the background and make every effort to stay out of the way during the investigation, but things like keyword hits and other artifacts could be really helpful and interesting to an investigator to know about as they are discovered during the file analysis process. Autopsy publishes the keyword hits it finds in two places:
- The evidence tree on the left-hand side of the main UI
- The ingest inbox, which has an icon in the top right of the main UI
The motivation for the ingest inbox is that it gives you a chronological perspective on what has been found. If you are focused on the user’s web activity, then you won’t notice that they keyword hit on a specific term went from 4 to 5. The ingest inbox though will tell you what has been found since you last opened it. The goal is to notify (but not annoy) an investigator that new evidence items of interest have been found by the background analysis tasks.
Ad-hoc Searching and Using an Index
You can never know all of the keywords that you will care about when you start the case. Autopsy makes a text index (using Apache Solr) of the text on the drive so that later searches are very fast. You can think of a text index like the index in a book. It allows for a direct mapping of words or concepts to pages and the locations those words appear in text. For instance, if you wanted to find all the mentions of the words “digital forensics” in a text book - you could page through each page regardless of whether it had your phrase of interest on it one by one and highlight the phrase when you found it, or you could look in the index and go to the pages it shows up on directly. Autopsy works like the second option when doing keyword searching which means you get your results fast.
Making ad-hoc queries happens through the search bar in the top right of the Autopsy interface. This accepts both REGEX patterns and plain search strings. You can also run new lists that you’ve created and loaded via the configuration options from this area of the interface. For each independent search, a new tab is created in the results viewer panel, which mean you can run multiple searches in parallel and review them independently.
There are two general methods for performing keyword searches in digital forensics:
- By interpreting the file types, extracting the text, converting it to Unicode (if needed), and matching it against the list of keywords.
- By coming up with all possible byte sequences of the keywords in the possible encodings and looking for all of those byte sequences at the lowest levels of the drive data.
Autopsy extracts the text using Apache Tika and some other open source libraries. For files in formats that we don’t support or for unallocated space, we extract the strings. In the Tools -> Options area of Autopsy, you can configure which languages that you want to extract strings from. The more languages that you select, the more false positive data that you will see.
The benefits of extracting text are:
- Finds text in compressed formats.
- Finds text in file formats that make up their own encoding (i.e. PDF)
Extensibility from Solr
A big design goal of Autopsy 3 has always been extensibility. Solr provides its own methods of extensibility. Some examples include:
- Autopsy currently runs a Solr server on your desktop system, which could get overburdened under heavy loads. In the future, we could allow Autopsy to use a central Solr instance if your environment could benefit from this.
- Extend the text analytics capabilities to obtain better results. We are using the standard library features, but better commercial ones also exist. For example, the text analytics side of Basis has Solr integration options for their lingustics components to get better results with non-English documents.
Start using this free and powerful keyword searching feature today - download Autopsy from sleuthkit.org/autopsy.