To sum up: A screen reader is going to take longer to describe even a very basic web site than a sighted person can view. Captions for decorative images are rarely described as "decorative image" so a person using a screen reader may not know it's something they can skip. Current standards are to either leave captions out or simply label them as "decorative." HTML tags can mark images as presentational so a screen reader knows to skip them, but it isn't widely used. I think all images should have captions, should be able to be marked as decorative, and screen readers should have a setting for users to enable/disable skipping them. A couple of people I know have mentioned enjoying a well-captioned decorative image. Overwhelming opinion, though, is, "describe & read this site as fast and coherently as possible."
Re: Auto-play video transcripts. This is mostly in reference to news-type sites who shove a video in your face. They leave someone out most of the time. They transcribe the videos and a person can activate the transcription, but the actual videos frequently don't offer a closed-caption feature. (Fine for users with sight disability, but not deaf users.) Though, sometimes, they have open captions. But since that's part of the video, a screen reader can't tell. (Fine for deaf users, but not users with sight disability.)
All of that said, many users have conditions that cause side-effects from things on web pages animating without their interaction, so it's generally not considered accessible for a page to have auto-play video overall.
Bookmarks