But what’s a lemma?
So, if you've read our blog, or explored our web site, you'd know that we're big into lemmatization as a mechanism to improve normalization, slim down search indices, and generally provide a higher quality multilingual search experience.
The big news is that we've just released a new plugin – Rosette Search Essentials for Elasticsearch that dramatically simplifies the implementation of Rosette, giving you improved search quality in minutes. Rosette Search Essentials for Elasticsearch includes tokenization, lemmatization, decompounding, POS tagging and more. And the best part is that Rosette Search Essentials is free for development use. So, download it today, and see how it can improve your search application.
But, back to the idea of Lemmatization, and why it matters. In most languages, and especially European languages, it is not uncommon for the forms of words to change substantially based on how they are used. Here's a simple example:
This presents a challenge for search applications because they must match the correct form of the word in order to serve up accurate results— this is called normalizing. Typically, most search solutions normalize by stemming, which is a crude method of chopping off characters at the end of a word in the attempt to find the root word. There are many stemming algorithms available for Elasticsearch, but you can imagine the problems that this technique might produce.
Think about the following English example:
The user searches for the word "Suite", but the search engine ends up with a stem of "Suit". That search would end up with numerous false positives completely unrelated to the original word. Not good!
This problem is actually much more common when you are a dealing with European languages.
In German for example, the word for sow (female pig) is “sau” and the word for sour is “sauer”.
The popular Snowball stemmer determines that the root for “sauer” is “sau”, and therefore when you search for “sauer Milch” (sour milk) you could easily end up with results for pig milk (sau Milch).
Umm...bacon flavored milk - think it will catch on?
To get truly accurate results, more advanced morphological analyses need to be performed such as finding the "lemma" or dictionary form of the word— this is called "lemmatization".
Rosette® Search Essentials for Elasticsearch, brings lemmatization to Elasticsearch users, alongside tokenization, decompounding, part of speech tagging, and other linguistic analyses to enable high-quality search in 40 languages. For more details on Rosette's linguistic capabilities view the Rosette Base Linguistics page.
We hope you’ll explore the improvements that Lemmatization can provide to your search application.
Thanks for reading
The Basis Technology team
We’re happy to announce that Basis Technology will be sponsoring Autopsy® 3 module development contests each semester for current university students! We think this is a great way to expand the functionality of Autopsy and get students involved in the open source development process.
If you are a teacher or student comfortable coding in Java, have a look at the contest page. If you are a teacher and want to use this contest as part of your class projects, then contact us. We're happy to help. Or, even if you don’t need anything from us, let us know and we’ll list you on the site as being part of the contest.
If you are not a student, then stay tuned for a module writing contest as part of the 2014 Open Source Digital Forensics Conference (OSDFCon). See the 2013 challenge page for last year's results and more details.
January - June 2014 Contest
There is a $600 cash prize as well as free admission to the 2014 OSDFCon that will be awarded to the winner, $400 for second place, and $200 for third. In true open source fashion, the final winners will be chosen using crowdsourcing techniques.
The point of these contests is to encourage creative thinking and provide some financial incentive to contribute useful (and free) functionality to the Autopsy community, and as a result the rules of the contests are minimal. The major restriction for participation is being a current undergraduate or graduate student with a .edu email address used for the submission to: firstname.lastname@example.org. The deadline for all submissions is June 27, 2014. For additional details about submission criteria and selection criteria see the contest page.
Also check out the existing modules page on sleuthkit.org, the github.com issues log for ideas, and the developers guide to get started!
Create your free online surveys with SurveyMonkey
, the world's leading questionnaire tool.
As readers of this blog know, Autopsy was designed to be a digital forensics platform that other open source developers can build modules for. To help motivate other developers to write Autopsy modules, Basis Technology created a module development challenge and we’re pleased to announce the winners.
The ground rules were simple:
- Make something useful and creative that can plugin to the Autopsy platform and release it as open source software
- Submit the module before the Open Source Digital Forensics Conference (OSDFCon).
- Present the module to the attendees of OSDFCon in person or via video.
- Profit! (in the form of cash prizes!)
We received two really great submissions. We were impressed by the amount of effort that went into each one of them (and note that we did not award a 3rd prize because there were not enough submissions, so you could have won some cash with even a basic module!). These modules have been tested by the Basis team and work with 3.0.7 and above in box 32 and 64-bit versions of Autopsy.
First Prize: $1,500
Author: Willi Ballenthin
Minimum version of Autopsy required: 3.0.7
Description: Willi wrote two modules that support registry analysis. One is an ingest module that detects registry hives and extracts the keys and values into “derived files” of the registry hive. This means that they are shown in the directory tree and you can navigate the registry structure and search its contents.
The second module was a new content viewer (the area in the lower right of Autopsy) that will show the tree of a registry hive and allow you to navigate it after you have selected the hive. If you use only this module, you will not see the registry expanded in the directory tree.
Both of these modules are great additions to the capabilities to Autopsy and provide the user with functions much like Regedit.exe to view registry hives.
License of source code: Apache 2
Second Prize: $500
Author: Petter Bjelland
Minimum Autopsy version: 3.0.7
Description: Petter developed a fuzzy hashing module based on sdhash. sdhash allows you to match files that are similar, but not necessarily exactly the same, as other files. With this ingest module and a new viewer, the investigator can match files against other files or sdhash reference sets during ingest, or search for similar files from the directory viewer or search results after ingest. Petter could not attend OSDFCon and instead submitted a video of the module in use. It is linked to below.
In addition to the great contribution to the community with this open source module, Petter also donated his cash prize to the Red Cross to benefit victims of Typhoon Haiyan in the Philippines.
Source URL: https://github.com/pcbje/autopsy-ahbm
Release Download: https://github.com/pcbje/autopsy-ahbm/releases
License: Apache 2.0
Video presentation: http://youtu.be/GBmZRufH_3o
We think these two modules show the power of the platform and the ability for it to change and evolve using the developer’s guide and some creative thinking.
See a list of all third party modules here: http://wiki.sleuthkit.org/index.php?title=Autopsy_3rd_Party_Modules
Congratulations and thanks to both Willi and Petter from the entire Autopsy community!
We’ll be doing this again next year alongside OSDFCon with the same rules, so feel free to start developing your modules now.
The numbers have been collected and data crunched for Basis Tech Week, and we can officially say that each conference was a success! Overall, attendance was up 32% from 2012 to 2013 for all three conferences...and that was with a government furlough occurring in the not-too-distant past. Basis Tech Week also went international this year with attendees flying to Chantilly from 13 countries. In case you missed any of the events, slides are now available online [scroll to the bottom] with videos coming soon. So let's break it down by conference, shall we?
Open Source Digital Forensics
Holy smokes! If someone wanted an indicator of how vibrant the open source digital forensic community is today, the Open Source Digital Forensics conference (now in its 4th year) would be a good place to start. In just one year, OSDF jumped from 204 registrants in 2012 to 416 registered this year, requiring us to move into a bigger ballroom. This year also marked the beginning of Autopsy training classes and a number of new OSDF-related tutorials. A few other highlights include Willi Ballenthin winning first place in the Autopsy plugin contest, and Petter Christian Bjelland generously donating his second place winnings to the Red Cross relief effort in the Philippines. You can download both modules here.
Open Source Search
Open source search is becoming a hot topic today with more and more companies realizing the importance of accurate search in their applications. Government agencies are also realizing the power of open source innovation — illustrated nicely in a recent open source report by GovLoop. As for OSS 2013 itself, we had a number of amazing talks from innovators in the field as well as interesting discussions led by chairwoman Sue Feldman. We also heard the latest developments of IBM's Watson in the healthcare realm by keynote Eric Brown.
Human Language Technology
Today's advanced search queries require text analytics algorithms smart enough to handle what people want to find. Basis Technology's own David Murgatroyd, Gregor Stewart, and Brian Carrier laid the groundwork for how natural language processing, search, and digital forensics can be tied together in revealing various types of information. An example of this is the new release of Highlight, which was highlighted upon (bad pun intended) in a talk by Nicholas Bemish and Jennifer Flather. Our keynote speakers, Skip McCormick and Doug Naquin, made it clear that Big Data in government has huge potential and requires smart minds to make it all come together. Lastly, Luminoso CEO, Catherine Havasi, and Graham Katz of CACI gave two highly popular talks delving into the future of human language technology with topics such as content-based sentiment analysis and detection of significant societal events using social media outlets like Twitter.
SIGNAL Magazine, an AFCEA publication exploring trends and techniques in defense, intelligence and global security technology, wrote an excellent article discussing how Basis Technology’s HIGHLIGHT software is helping the Intelligence Community (IC) put names in standard spelling so they are easier to find and connect. What struck me was the the notion of how important HIGHLIGHT is in supporting collaboration across many departments.
IC name transliteration standards were put in place back in 2003, but it was a tedious and error-prone process to manually match name spelling variations to these new standards. Nicholas Bemish, senior expert for human language technology at the DIA, sums up the situation best in the SIGNAL article, saying: “We had individuals who either were going towards prosecution litigation, trying to get visas to enter the United States or actually having gotten onto the plane, where officials looked [at records] and said, ‘well his name was spelled in the Department of Homeland Security database this way and spelled in the State Department visa registry this way … so how would we have known?’”
Bemish recognized that a technology solution that fit into an analyst's existing workflow was the key to a successful program. He also recognized that by starting with a commercial off-the-shelf set of software tools, and adapting them to his needs, he could deliver a single program with a single source of funding, which would speed the deployment to as many IC analysts as possible. “We have saved money across the intelligence community by reducing those cost burdens to each agency, and we’re all getting the same product and benefit,” he declares.
As a result of this clever strategy of cross-IC licensing and funding, HIGHLIGHT is being widely adopted, enabling efficient collaboration across numerous government agencies. “This way, since we use this system across the 16 intelligence community partnerships worldwide … we’re actually solving that problem [of matching names in multiple databases] today,” Bemish emphasizes.
This type of “top down” licensing model has also fascilitated the enhancement of HIGHLIGHT to better address the needs of multiple IC users. HIGHLIGHT is now able to perform entity translation and standardization in five languages (Arabic, Mandarin, Dari, Farsi and Pashto), with Korean and Russian language support available in the near future. Discussions are also underway to extend HIGHLIGHT to provide further assistance for entity analytics and resolution throughout the full intelligence production cycle on any platform or “the cloud”.
We're pleased that our partnership with The Office of the Director of National Intelligence (ODNI) and the Defense Intelligence Agency (DIA) has allowed Basis Technology to play a small, yet vital, role in the global fight against potential threats.
If you are an IC user and would like access to Highlight for you department, be sure to join us at the HIGHLIGHT Program Managers Meeting to find out more details. You can also find out more information about HIGHLIGHT at http://www.basistech.com/highlight.
The Open Source Search conference (part of Basis Tech Week) is less than a month away, and we’re looking forward to some fascinating presentations from the big players in the space. This year, we're excited to have Sue Feldman — formerly of IDC and now CEO of Synthexis — as the conference chairwoman. She will be talking about open source search within the larger context of building an overall information strategy and developing the infrastructure to support it.
To support this theme, Sue has fielded a survey to help us all learn more about the use of search and text analytics software in organizations. Sue will share the preliminary results of this survey during the conference, and each participant will have the opportunity to receive a written summary of the results when the data has been compiled.
Please follow this link to the Synthexis Information Access Survey and we hope you will join us at OSS this year.
Basis Technology is offering an Autopsy training class directly following the Open Source Digital Forensics Conference (http://OSDFCon.org) in Chantilly, VA. This is part of an initial push to establish the premier training program for this powerful open source digital forensics tool.
For those who have already downloaded and tried Autopsy (and there are over 13,000 people who have done so for the current version) or those who have seen us present at conferences, you may be wondering why you need the training. After all, we’ve designed Autopsy to be easy to use and intuitive out of the box. The reason you want to attend this training is that you’ll learn about what is going on under the covers. Anyone can press buttons, but to testify about a tool, you need to understand what happens when you press the button.
You know you’ll get all of the details because it will be taught by a combination of engineers and examiners. The current plan is to combine the use case experience from one of our examiners with the implementation details from one of our engineers (likely Brian Carrier).
The course spans two days and includes both lectures and hands-on examples. Computers are provided with the training. During the class, we’ll cover:
- Autopsy set-up and overview
- In-depth coverage of each of the ingest modules, including:
- Keyword Search
- Hash Lookup
- Recent User Activity (registry, web, etc.)
- Archive Extractor
- Views and how to use them.
- Tagging and reporting
- Hands-on case study with tutorials and problem sets
- Harnessing automation and workflow feature
Register for the training here.
The current incarnation of the Autopsy training class does not provide a certification. We understand the benefits of being certified to use a tool and are working towards defining the criteria of the certification so that it is viewed as a respectable level of achievement. If you interested in getting involved with this process after you attend the class, let us know.
Training for Developers
One of the key things that we talk about with Autopsy is its extensibility and how we designed it to be a platform for others to build modules for. We know we can’t solve everyone’s problems. This 2-day course is designed for examiners who are going to use the tool. We didn’t forget about the developers though.
We have a ½ workshop at OSDFCon about developing modules for Autopsy. Register soon for that event if you want to learn how to integrate your existing tool (or a new tool) into Autopsy. This will enable you to reach a bigger audience and not have to worry about disk images and file systems.
Training for Trainers
At this point, we are doing all of the training ourselves. If you are interested in us coming to your site to conduct training, let us know. We will schedule more events for 2014 and feedback on where we have significant interest is important.
If you teach at a University and want to involve Autopsy in your curriculum, let us know. We are starting to provide some outreach on this topic and can facilitate sharing of curriculum materials between educators. Contact us if you are using Autopsy and want to share your resources with other educators.
Register for the training and get some more details, by clicking here.
Keyword searching is a common and widely used investigation technique across all varieties of digital investigations. On the surface, it seems fairly straightforward - figure out what names, places, things, activities, applications, etc. you want to search for and perform the search. Simple right? What about misspellings? Or patterns of text versus an exact match? Many of the commercial digital forensics tools help solve some of these considerations and so does Autopsy - but it does it for free. Autopsy uses the open source Apache Solr and Tika libraries for fast and efficient text indexing and searching.
There are two basic times when you can keyword search in Autopsy.
- Keyword lists can be used during ingest when the files are being added to the case.
- Individual ad-hoc queries can be performed during the course of the analysis.
These two methods can actually happen at the same time during an investigation because you can start your analysis while the ingest process continues to process files. If you find keywords that you want to add to your search, they are added to the ingest list and will occur in the background. This means you don’t have to wait to start your investigation until your tool finishes processing. Why wait? Investigate!
Ingest modules and Near Real-time results
When adding a disk image to a case, the user chooses the ingest modules that they want to run on the image. One of the standard modules is Keyword Search and it will extract the text from files and add them to a Solr index. It will periodically, default is every 5 minutes, query the index for a list of keywords that the user configures.
The user can search for plain text search strings (like “Jesse James”) or patterns (regular expressions - REGEX). Autopsy comes with a list of pre-defined regular expressions to find phone numbers, email addresses, IP addresses, and URLs. A common strategy is to load up the keyword lists with common misspellings of important words or develop REGEX that will find close matches words or patterns of interest.
Autopsy’s ingest modules run in the background and make every effort to stay out of the way during the investigation, but things like keyword hits and other artifacts could be really helpful and interesting to an investigator to know about as they are discovered during the file analysis process. Autopsy publishes the keyword hits it finds in two places:
- The evidence tree on the left-hand side of the main UI
- The ingest inbox, which has an icon in the top right of the main UI
The motivation for the ingest inbox is that it gives you a chronological perspective on what has been found. If you are focused on the user’s web activity, then you won’t notice that they keyword hit on a specific term went from 4 to 5. The ingest inbox though will tell you what has been found since you last opened it. The goal is to notify (but not annoy) an investigator that new evidence items of interest have been found by the background analysis tasks.
Ad-hoc Searching and Using an Index
You can never know all of the keywords that you will care about when you start the case. Autopsy makes a text index (using Apache Solr) of the text on the drive so that later searches are very fast. You can think of a text index like the index in a book. It allows for a direct mapping of words or concepts to pages and the locations those words appear in text. For instance, if you wanted to find all the mentions of the words “digital forensics” in a text book - you could page through each page regardless of whether it had your phrase of interest on it one by one and highlight the phrase when you found it, or you could look in the index and go to the pages it shows up on directly. Autopsy works like the second option when doing keyword searching which means you get your results fast.
Making ad-hoc queries happens through the search bar in the top right of the Autopsy interface. This accepts both REGEX patterns and plain search strings. You can also run new lists that you’ve created and loaded via the configuration options from this area of the interface. For each independent search, a new tab is created in the results viewer panel, which mean you can run multiple searches in parallel and review them independently.
There are two general methods for performing keyword searches in digital forensics:
- By interpreting the file types, extracting the text, converting it to Unicode (if needed), and matching it against the list of keywords.
- By coming up with all possible byte sequences of the keywords in the possible encodings and looking for all of those byte sequences at the lowest levels of the drive data.
Autopsy extracts the text using Apache Tika and some other open source libraries. For files in formats that we don’t support or for unallocated space, we extract the strings. In the Tools -> Options area of Autopsy, you can configure which languages that you want to extract strings from. The more languages that you select, the more false positive data that you will see.
The benefits of extracting text are:
- Finds text in compressed formats.
- Finds text in file formats that make up their own encoding (i.e. PDF)
Extensibility from Solr
A big design goal of Autopsy 3 has always been extensibility. Solr provides its own methods of extensibility. Some examples include:
- Autopsy currently runs a Solr server on your desktop system, which could get overburdened under heavy loads. In the future, we could allow Autopsy to use a central Solr instance if your environment could benefit from this.
- Extend the text analytics capabilities to obtain better results. We are using the standard library features, but better commercial ones also exist. For example, the text analytics side of Basis has Solr integration options for their lingustics components to get better results with non-English documents.
Start using this free and powerful keyword searching feature today - download Autopsy from sleuthkit.org/autopsy.
Over the years Basis Technology has developed partnerships with a wide variety of companies.
These collaborations have brought the advanced multilingual search and text analytics capabilities of Rosette, our enterprise linguistics platform, to e-discovery, government intelligence, financial compliance, and social media monitoring. Recently, we’ve had an incredible push for open source search, and this summer we are pleased to announce a new partnership with the powerful open source search engine, Elasticsearch.
Founded in 2012 as the commercial offshoot of the Elasticsearch.org open source project, Elasticsearch.com provides commercial services in order to bring better, faster, and stronger search and data exploration capabilities to the world. Built on the Apache Lucene information retrieval library, Elasticsearch has remained true to its open source roots. The company’s advanced search and analytics engine is free to download on their website and they offer several paid production/development support and training packages to complement and enhance the open source product. With its flexible, distributed, and highly scalable search and analytics platform and attentive, personalized customer service, Elasticsearch has quickly become one of the leading search engines in the world. Their customers include GitHub, Foursquare, Klout, SoundCloud, and StumbleUpon among many others.
The folks at Elasticsearch like to say that they’re obsessed with data, and by joining forces with Basis Technology, Elasticsearch users can now explore and critically analyze data in over forty languages. Basis’s Rosette Base Linguistics (RBL) platform is an advanced multilingual analytics toolset equipped to handle the complex linguistic challenges of European, Asian, and Middle Eastern languages. RBL provides highly sophisticated morphological analytic functionality, including tokenization, lemmatization, decompounding, and part-of-speech (POS) tagging. The new Rosette Elasticsearch Plugin allows users to quickly and easily knit together RBL’s multilingual analysis capabilities with the powerful Elasticsearch search engine. Additionally, the Elasticsearch connector has been released as open source to GitHub. While users will still need to have a Rosette license to harness the combined power of Basis and Elasticsearch, the open source nature of the code will allow the greater community to explore and adapt the technology to support new and currently unrealized potential.
For Basis, this partnership is particularly timely, given our upcoming Open Source Search (OSS) Conference on November 6th, part of our annual Basis Tech Week conference series. The OSS conference will feature a talk by Elasticsearch engineer Martijn van Groningen entitled "Optimizing Document Relations with Elasticsearch". Basis Tech Week 2013 will take place from Nov 4-7 at the Westfields Marriott in Chantilly, VA, and is free for government employees. In addition to open source search, the event also features day-long conferences on open source digital forensics technology and human language technology. To learn more, and to register, visit http://www.basistechweek.com/.
Andrew Paulsen, Basis Technology Regional Sales Director, was interviewed recently by Mark Bennet on Searchhub.org. In the interview, Andrew succintly answers many of the common questions that our prospective customers are asking about how our software complements and improves open source search engines. Here's an excerpt!
To sum up our value proposition in relation to open source linguistics; we provide higher quality, more in-depth features, a wider breadth of language coverage and better performance/reliability. And as you know, software engineers are expensive these days, especially search engineers with an NLP background. Companies can actually save money and increase development productivity by licensing a commercial ready NLP platform as opposed to having these well paid engineers implementing and testing various linguistic modules from around the world with various levels of quality and performance.