Segment Anything Model (SAM): Universal Meta AI Tool for Object Recognition and Its Diverse Applications

Segment Anything Model (SAM) is a photo and video editing tool recently released by Meta that allows you to cut any object out with a single click. In other words, it can recognize any object just like a human would.
That said, this tool has a much greater potential than just in creative industries. It can be used for studying space or in the military.
So, could Meta’s SAM be the first foundation model for image segmentation?

Reading Time: 3 minutes

segment anything meta

Illustration: Segment Anything and MilicaM

What is Segment Anything Model (SAM)?   

Before we go any further, first we need to explain what segmentation is, since this term is crucial to understand SAM. 

Put simply, what segmentation does is identify which image pixels belong to an object. This is the core task of computer vision. If you were ever to edit a photo and, let’s say, reduce spots on the skin, you were probably using segmentation. Of course, its application goes way further, such as for analyzing scientific imagery. 

According to the Segment Anything blog, up until now, there wasn’t a comprehensive model that didn’t require highly specialized work by technical experts with access to AI training infrastructure and large volumes of carefully annotated in-domain data. 

At the beginning of April, Meta AI Research introduced the Segment Anything Model (SAM) and the corresponding dataset of 1.1 billion high-quality segmentation masks and 11 million images. 

The main goal of this project is to build a foundation model for image segmentation based on prompts and, in their words, democratize segmentation by enabling a broad set of applications

Here’s what the project team itself stated about the project: 

image-1

The Segment Anything project is an attempt to lift image segmentation into the era of foundation models. Our principal contributions are a new task (promptable segmentation), model (SAM), and dataset (SA-1B) that make this leap possible.

Source: Whitepaper, page 12 

In other words: 

  1. SAM allows users to segment objects with just a click or by interactively clicking points to include and exclude from the object. 
  2. SAM can output multiple valid masks when faced with ambiguity about the object being segmented. 
  3. SAM can automatically find and mask all objects in an image. 
  4. SAM can generate a segmentation mask for any prompt in real-time, allowing for real-time interaction with the model. 

How does SAM work? 

Supervised training has resulted in a self-supervised technique. 

So, for example, images are grouped by the number of masks per image (there are ∼100 masks per image on average). Prior experimenting and verification from humans have shown that SAM is now almost completely (99.1%) capable of automatically annotating the images correctly. 

This mask dataset is available for research purposes and the SAM is available under a permissive open license (Apache 2.0). 

sam meta ai

Example images with overlaid masks. Source: SAM Whitepaper 

As mentioned, SAM has released a dataset of 11 million images and, as they claim in the official whitepaper, all of them are licensed, high-resolution, privacy-protecting images. 

This is yet another revolutionary practice in AI generated images, considering that MidJourney’s founder David Holz admitted to having used millions of images for training without the artists’ consent, claiming there’s no way to trace the photo back to its owner and admitting to copyright infringement. 

So, if Meta really did manage to find a way to do so, it’s truly a new industry standard

Namely, they worked with a provider that works directly with photographers 

It’s also important to emphasize that the average image size is 3300×4950, which can be storage exhausting, and that’s why they also provide a downsampled version of images, down to 1500 pixels. Even when the quality is decreased, it’s still better than existing datasets, such as COCO whose images are ∼480×640 pixels. 

And finally, to tackle its speed. It’s supposed to enable seamless, real-time interactive prompting. Or, more precisely, “the prompt encoder and mask decoder run in a web browser, on CPU, in ∼50ms.” 

You can also try the SAM demo with your own images. 

What will SAM be used for? 

Another impressive key takeaway about the SA project is that it can be used on new images without requiring additional training. 

image-1

SAM has learned a general notion of what objects are, and it can generate masks for any object in any image or any video, even including objects and image types that it had not encountered during training.

In the beginning, SAM will probably be most represented in these areas: 

  • Picture and video editing; 
  • Design (including interior design). 

But really, SAM could have an application in any field that requires finding and segmenting any object in any image. That can be something as simple as understanding both the visual and text content of a webpage. 

If we’re talking about more impressive use cases, think of AR/VR. For example, the user will be able to select an object by simply gazing at it and lifting it into 3D. Isn’t that a superpower that we’ve all been dreaming of? 

Why not go even further and say that it can be used in space, by localizing animals or objects to study and track in the video. 

SAM Limitations 

In the cited whitepaper, the project team has also pointed out the limitations of this model noted so far. 

image-1

While SAM performs well in general, it is not perfect. It can miss fine structures, hallucinates small, disconnected components at times, and does not produce boundaries as crisply as more computationally intensive methods that “zoom-in”.

Source: Whitepaper, page 12 

In addition, they emphasized that the model is mainly designed for general purposes. Moreover, they stated they’d continue to improve its text-to-mask task, performance speed, and ease of designing simple prompts that implement semantic and panoptic segmentation. 

If you are interested in finding out more about the Segment Anything project, we encourage you to check the official website and join their Discord channel. 

A journalist by day and a podcaster by night. She's not writing to impress but to be understood.

[the_ad_placement id="end-body"]