Modeling Extremely Large Images with xT

By Ritwik Gupta, Shufan Li, Tyler Zhu, Jitendra Malik, Trevor Darrell, Karttikeya Mangalam

As computer vision researchers, we believe that every pixel can tell a story. However, there seems to be a writer’s block settling into the field when it comes to dealing with large images. Large images are no longer rare—the cameras we carry in our pockets and those orbiting our planet snap pictures so big and detailed that they stretch our current best models and hardware to their breaking points when handling them. Generally, we face a quadratic increase in memory usage as a function of image size.

Today, we make one of two sub-optimal choices when handling large images: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. We take another look at these approaches and introduce $x$T, a new framework to model large images end-to-end on contemporary GPUs while effectively aggregating global context with local details.

Why Bother with Big Images Anyway?

Why bother handling large images anyways? Picture yourself in front of your TV, watching your favorite football team. The field is dotted with players all over with action occurring only on a small portion of the screen at a time. Would you be satisified, however, if you could only see a small region around where the ball currently was? Alternatively, would you be satisified watching the game in low resolution? Every pixel tells a story, no matter how far apart they are. This is true in all domains from your TV screen to a pathologist viewing a gigapixel slide to diagnose tiny patches of cancer. These images are treasure troves of information. If we can’t fully explore the wealth because our tools can’t handle the map, what’s the point?

Sports are fun when you know what's going on

That’s precisely where the frustration lies today. The bigger the image, the more we need to simultaneously zoom out to see the whole picture and zoom in for the nitty-gritty details, making it a challenge to grasp both the forest and the trees simultaneously. Most current methods force a choice between losing sight of the forest or missing the trees, and neither option is great.

How $x$T Tries to Fix This

Imagine trying to solve a massive jigsaw puzzle. Instead of tackling the whole thing at once, which would be overwhelming, you start with smaller sections, get a good look at each piece, and then figure out how they fit into the bigger picture. That’s basically what we do with large images with $x$T.

In Conclusion

For a complete treatment of this work, please check out the paper on arXiv. The project page contains a link to our released code and weights. If you find the work useful, please cite it as below:

        
            @article{xTLargeImageModeling,

            title={xT: Nested Tokenization for Larger Context in Large Images},

            author={Gupta, Ritwik and Li, Shufan and Zhu, Tyler and Malik, Jitendra and Darrell, Trevor and Mangalam, Karttikeya},

            journal={arXiv preprint arXiv:2403.01915},

            year={2024}

            }

Modeling Large Images with xT on BAIR Blog

Modeling Extremely Large Images with xT

Why Bother with Big Images Anyway?

How $x$T Tries to Fix This

In Conclusion

Leave a Reply Cancel reply

Editor's Pick

The Future of J&K: What’s Next?

Alcaraz defends Wimbledon title, defeating Djokovic | Tennis News

Government Proposes Tax Deductions for Skill Development

Oponion

The Future of J&K: What’s Next?

Alcaraz defends Wimbledon title, defeating Djokovic | Tennis News

Government Proposes Tax Deductions for Skill Development

MS Dhoni, Sakshi & Malaika Arora Share Heartfelt Notes for Anant Ambani & Radhika Merchant

India clinches series 4-1 with 42-run win over Zimbabwe; Samson, Mukesh stars

You Might Also Like

2024 BAIR Grad Directory: The Berkeley AI Research Blog

Function Calling at the Edge: Berkeley AI Research Blog

At Indianewslines, we deliver the latest, most accurate, and comprehensive news from across the nation, keeping you informed and engaged with stories that matter.

Quick Links

Legal Stuff

Modeling Extremely Large Images with xT

Why Bother with Big Images Anyway?

How $x$T Tries to Fix This

In Conclusion

Leave a Reply Cancel reply

Editor's Pick

Oponion

You Might Also Like

At Indianewslines, we deliver the latest, most accurate, and comprehensive news from across the nation, keeping you informed and engaged with stories that matter.

Quick Links

Legal Stuff

Subscribe Newsletter