正则表达式：有什么我应该用来实现我想要的？(Regex: Is there something else I should be using to achieve what I want?)

网站建设770 更新时间：2025-06-08 21:40:56

基于大量搜索示例，我创建了一个正则表达式，我使用（作为后备）来解析HTML源代码中的直接文件链接：

/((?:(?:https?%3A%2F%2F)(?:www\.)?(?:\S+)%2F|(?:https?:\/\/)(?:www\.)?(?:\S+)\/)(?:.*)?\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg)(?=[^.]*$))/igm

我的问题是，它在其中有多个链接的行上失败。我知道使用正则表达式解析HTML即使作为回退也是如此，我还有什么可以用来查找页面源代码中的所有直接文件链接？（这意味着链接隐藏在内联JavaScript，视频源标签等中，而不仅仅是document.links返回的内容。）

如果没有更好的建议，有人可以帮助我修复正则表达式来实现我想要的吗？

正则表达式应遵循以下准则：

返回组1中的URL和组2中的文件扩展名查找编码和解码的URL 查找特定的文件扩展名（即视频和音频）允许多级文件扩展名允许URL中的空格允许使用或不使用“www”的HTTP方案的任何安全和非安全域查找所有网址，而不管它们在HTML源代码中的位置与JavaScript兼容

一些应该匹配的例子：

http://test.com/test.mkv http://test.com/test/test.jpg.mkv https://test.com/test.mkv?test=test http%3A%2F%2Ftest.com%2Ftest.mkv%3Ftest%3Dtest https%3A%2F%2Ftest.com%2Ftest.jpg.mkv%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv

最后一个例子应该匹配两个URL，而不是__some__random__string__ 。

一些不应该匹配的示例：

http://test.com/test.mkv.jpg http://test.com/test.mkv/test.jpg https://test.com/test.mkv.jpg?test=test.mkv http%3A%2F%2Ftest.com%2Ftest.mkv.jpg https%3A%2F%2Ftest.com%2Ftest.mkv.jpg%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv.jpg

最后一个例子应该只匹配__some__random__string__之前的第一个URL。

您可以使用正则表达式以及http://regexr.com/3dbac部分失败的HTML源代码示例

Based off lots of examples from searching, I have created a regex that I use (as a fallback) to parse direct file links from HTML source:

/((?:(?:https?%3A%2F%2F)(?:www\.)?(?:\S+)%2F|(?:https?:\/\/)(?:www\.)?(?:\S+)\/)(?:.*)?\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg)(?=[^.]*$))/igm

My problem is that it fails on lines that have more than one link in them. I know that parsing HTML using a regex even as a fallback is frowned upon so, what else could I use to find ALL direct file links in a page's source? (This means links hidden in inline JavaScript, video source tags and the like; not just what document.links returns.)

If there are not any better suggestions, can someone help me fix the regex to achieve what I want?

The regex should follow these guidelines:

Return the URL in group one and the file extension in group two Find both encoded and decoded URLs Find specific file extensions (namely video and audio) Tolerate multi-level file extensions Tolerate spaces in the URL Allow any domain, both secure and non-secure, with or without "www" for the HTTP scheme Find all URLs regardless of their location in the HTML source Be compatible with JavaScript

Some examples that should be matched:

http://test.com/test.mkv http://test.com/test/test.jpg.mkv https://test.com/test.mkv?test=test http%3A%2F%2Ftest.com%2Ftest.mkv%3Ftest%3Dtest https%3A%2F%2Ftest.com%2Ftest.jpg.mkv%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv

The last example should match the two URLs but, not the __some__random__string__.

Some examples that should not be matched:

http://test.com/test.mkv.jpg http://test.com/test.mkv/test.jpg https://test.com/test.mkv.jpg?test=test.mkv http%3A%2F%2Ftest.com%2Ftest.mkv.jpg https%3A%2F%2Ftest.com%2Ftest.mkv.jpg%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv.jpg

The last example should match only the first URL, before __some__random__string__.

You can play with the regex and an example of some HTML source that partially fails at: http://regexr.com/3dbac

最满意答案

那么，如果我们只考虑你在这里提供的样本，你可以利用脾气暴躁的令牌 （TGT）来否定你需要匹配的扩展：

/((?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/)(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])*\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))(?!\/|\.[a-z]{1,3})/

查看正则表达式演示

模式分解：

( # Group 1 matching the whole URL (?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/) # Matching URL part with no spaces up to the last / (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* # TGT matching up to the extension \.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg) # The extension ) (?!\/|\.[a-z]{1,3}) # Only if not followed with /, or another extension

(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* TGT匹配任何字符除/不是/ .mp4 ， .mkv等文字字符序列的第一个字符外（如果其匹配模式与字符串中当前位置右侧的文本相匹配，则负向预览将会失败。

Well, if we take into account only the sample you provided here, you might leverage a tempered greedy token (TGT) to negate the extensions you need to match:

/((?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/)(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])*\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))(?!\/|\.[a-z]{1,3})/

See the regex demo

The pattern breakdown:

( # Group 1 matching the whole URL (?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/) # Matching URL part with no spaces up to the last / (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* # TGT matching up to the extension \.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg) # The extension ) (?!\/|\.[a-z]{1,3}) # Only if not followed with /, or another extension

The (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* TGT matches any character other than / that is not the first character of a .mp4, .mkv, etc. literal character sequences (as the negative lookahead fails the match if its pattern matches the text to the right of the current location in the string.

正则表达式：有什么我应该用来实现我想要的？(Regex: Is there something else I should be using to achieve what I want?)

基于大量搜索示例，我创建了一个正则表达式，我使用（作为后备）来解析HTML源代码中的直接文件链接：

/((?:(?:https?%3A%2F%2F)(?:www\.)?(?:\S+)%2F|(?:https?:\/\/)(?:www\.)?(?:\S+)\/)(?:.*)?\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg)(?=[^.]*$))/igm

我的问题是，它在其中有多个链接的行上失败。我知道使用正则表达式解析HTML即使作为回退也是如此，我还有什么可以用来查找页面源代码中的所有直接文件链接？（这意味着链接隐藏在内联JavaScript，视频源标签等中，而不仅仅是document.links返回的内容。）

如果没有更好的建议，有人可以帮助我修复正则表达式来实现我想要的吗？

正则表达式应遵循以下准则：

返回组1中的URL和组2中的文件扩展名查找编码和解码的URL 查找特定的文件扩展名（即视频和音频）允许多级文件扩展名允许URL中的空格允许使用或不使用“www”的HTTP方案的任何安全和非安全域查找所有网址，而不管它们在HTML源代码中的位置与JavaScript兼容

一些应该匹配的例子：

http://test.com/test.mkv http://test.com/test/test.jpg.mkv https://test.com/test.mkv?test=test http%3A%2F%2Ftest.com%2Ftest.mkv%3Ftest%3Dtest https%3A%2F%2Ftest.com%2Ftest.jpg.mkv%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv

最后一个例子应该匹配两个URL，而不是__some__random__string__ 。

一些不应该匹配的示例：

http://test.com/test.mkv.jpg http://test.com/test.mkv/test.jpg https://test.com/test.mkv.jpg?test=test.mkv http%3A%2F%2Ftest.com%2Ftest.mkv.jpg https%3A%2F%2Ftest.com%2Ftest.mkv.jpg%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv.jpg

最后一个例子应该只匹配__some__random__string__之前的第一个URL。

您可以使用正则表达式以及http://regexr.com/3dbac部分失败的HTML源代码示例

Based off lots of examples from searching, I have created a regex that I use (as a fallback) to parse direct file links from HTML source:

/((?:(?:https?%3A%2F%2F)(?:www\.)?(?:\S+)%2F|(?:https?:\/\/)(?:www\.)?(?:\S+)\/)(?:.*)?\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg)(?=[^.]*$))/igm

My problem is that it fails on lines that have more than one link in them. I know that parsing HTML using a regex even as a fallback is frowned upon so, what else could I use to find ALL direct file links in a page's source? (This means links hidden in inline JavaScript, video source tags and the like; not just what document.links returns.)

If there are not any better suggestions, can someone help me fix the regex to achieve what I want?

The regex should follow these guidelines:

Return the URL in group one and the file extension in group two Find both encoded and decoded URLs Find specific file extensions (namely video and audio) Tolerate multi-level file extensions Tolerate spaces in the URL Allow any domain, both secure and non-secure, with or without "www" for the HTTP scheme Find all URLs regardless of their location in the HTML source Be compatible with JavaScript

Some examples that should be matched:

http://test.com/test.mkv http://test.com/test/test.jpg.mkv https://test.com/test.mkv?test=test http%3A%2F%2Ftest.com%2Ftest.mkv%3Ftest%3Dtest https%3A%2F%2Ftest.com%2Ftest.jpg.mkv%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv

The last example should match the two URLs but, not the __some__random__string__.

Some examples that should not be matched:

http://test.com/test.mkv.jpg http://test.com/test.mkv/test.jpg https://test.com/test.mkv.jpg?test=test.mkv http%3A%2F%2Ftest.com%2Ftest.mkv.jpg https%3A%2F%2Ftest.com%2Ftest.mkv.jpg%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv.jpg

The last example should match only the first URL, before __some__random__string__.

You can play with the regex and an example of some HTML source that partially fails at: http://regexr.com/3dbac

最满意答案

那么，如果我们只考虑你在这里提供的样本，你可以利用脾气暴躁的令牌 （TGT）来否定你需要匹配的扩展：

/((?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/)(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])*\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))(?!\/|\.[a-z]{1,3})/

查看正则表达式演示

模式分解：

( # Group 1 matching the whole URL (?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/) # Matching URL part with no spaces up to the last / (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* # TGT matching up to the extension \.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg) # The extension ) (?!\/|\.[a-z]{1,3}) # Only if not followed with /, or another extension

(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* TGT匹配任何字符除/不是/ .mp4 ， .mkv等文字字符序列的第一个字符外（如果其匹配模式与字符串中当前位置右侧的文本相匹配，则负向预览将会失败。

Well, if we take into account only the sample you provided here, you might leverage a tempered greedy token (TGT) to negate the extensions you need to match:

/((?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/)(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])*\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))(?!\/|\.[a-z]{1,3})/

See the regex demo

The pattern breakdown:

( # Group 1 matching the whole URL (?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/) # Matching URL part with no spaces up to the last / (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* # TGT matching up to the extension \.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg) # The extension ) (?!\/|\.[a-z]{1,3}) # Only if not followed with /, or another extension

The (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* TGT matches any character other than / that is not the first character of a .mp4, .mkv, etc. literal character sequences (as the negative lookahead fails the match if its pattern matches the text to the right of the current location in the string.

本文发布于:2023-08-28，感谢您对本站的认可！

本文链接:http://torson.com.cn/wangzhan/1693202706a696366.html

正则表达式：有什么我应该用来实现我想要的？(Regex: Is there something else I should be using to achieve what I want?)

最满意答案

最满意答案

发布评论取消回复

最近发表

相关推荐

标签列表

正则表达式：有什么我应该用来实现我想要的？(Regex: Is there something else I should be using to achieve what I want?)

最满意答案

最满意答案

发布评论 取消回复

最近发表

相关推荐

标签列表

发布评论取消回复