基于大量搜索示例,我创建了一个正则表达式,我使用(作为后备)来解析HTML源代码中的直接文件链接:
/((?:(?:https?%3A%2F%2F)(?:www\.)?(?:\S+)%2F|(?:https?:\/\/)(?:www\.)?(?:\S+)\/)(?:.*)?\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg)(?=[^.]*$))/igm我的问题是,它在其中有多个链接的行上失败。 我知道使用正则表达式解析HTML即使作为回退也是如此,我还有什么可以用来查找页面源代码中的所有直接文件链接? (这意味着链接隐藏在内联JavaScript,视频源标签等中,而不仅仅是document.links返回的内容。)
如果没有更好的建议,有人可以帮助我修复正则表达式来实现我想要的吗?
正则表达式应遵循以下准则:
返回组1中的URL和组2中的文件扩展名 查找编码和解码的URL 查找特定的文件扩展名(即视频和音频) 允许多级文件扩展名 允许URL中的空格 允许使用或不使用“www”的HTTP方案的任何安全和非安全域 查找所有网址,而不管它们在HTML源代码中的位置 与JavaScript兼容一些应该匹配的例子:
http://test.com/test.mkv http://test.com/test/test.jpg.mkv https://test.com/test.mkv?test=test http%3A%2F%2Ftest.com%2Ftest.mkv%3Ftest%3Dtest https%3A%2F%2Ftest.com%2Ftest.jpg.mkv%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv最后一个例子应该匹配两个URL,而不是__some__random__string__ 。
一些不应该匹配的示例:
http://test.com/test.mkv.jpg http://test.com/test.mkv/test.jpg https://test.com/test.mkv.jpg?test=test.mkv http%3A%2F%2Ftest.com%2Ftest.mkv.jpg https%3A%2F%2Ftest.com%2Ftest.mkv.jpg%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv.jpg最后一个例子应该只匹配__some__random__string__之前的第一个URL。
您可以使用正则表达式以及http://regexr.com/3dbac部分失败的HTML源代码示例
Based off lots of examples from searching, I have created a regex that I use (as a fallback) to parse direct file links from HTML source:
/((?:(?:https?%3A%2F%2F)(?:www\.)?(?:\S+)%2F|(?:https?:\/\/)(?:www\.)?(?:\S+)\/)(?:.*)?\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg)(?=[^.]*$))/igmMy problem is that it fails on lines that have more than one link in them. I know that parsing HTML using a regex even as a fallback is frowned upon so, what else could I use to find ALL direct file links in a page's source? (This means links hidden in inline JavaScript, video source tags and the like; not just what document.links returns.)
If there are not any better suggestions, can someone help me fix the regex to achieve what I want?
The regex should follow these guidelines:
Return the URL in group one and the file extension in group two Find both encoded and decoded URLs Find specific file extensions (namely video and audio) Tolerate multi-level file extensions Tolerate spaces in the URL Allow any domain, both secure and non-secure, with or without "www" for the HTTP scheme Find all URLs regardless of their location in the HTML source Be compatible with JavaScriptSome examples that should be matched:
http://test.com/test.mkv http://test.com/test/test.jpg.mkv https://test.com/test.mkv?test=test http%3A%2F%2Ftest.com%2Ftest.mkv%3Ftest%3Dtest https%3A%2F%2Ftest.com%2Ftest.jpg.mkv%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkvThe last example should match the two URLs but, not the __some__random__string__.
Some examples that should not be matched:
http://test.com/test.mkv.jpg http://test.com/test.mkv/test.jpg https://test.com/test.mkv.jpg?test=test.mkv http%3A%2F%2Ftest.com%2Ftest.mkv.jpg https%3A%2F%2Ftest.com%2Ftest.mkv.jpg%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv.jpgThe last example should match only the first URL, before __some__random__string__.
You can play with the regex and an example of some HTML source that partially fails at: http://regexr.com/3dbac
最满意答案
那么,如果我们只考虑你在这里提供的样本,你可以利用脾气暴躁的令牌 (TGT)来否定你需要匹配的扩展:
/((?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/)(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])*\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))(?!\/|\.[a-z]{1,3})/查看正则表达式演示
模式分解:
( # Group 1 matching the whole URL (?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/) # Matching URL part with no spaces up to the last / (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* # TGT matching up to the extension \.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg) # The extension ) (?!\/|\.[a-z]{1,3}) # Only if not followed with /, or another extension(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* TGT匹配任何字符除/不是/ .mp4 , .mkv等文字字符序列的第一个字符外(如果其匹配模式与字符串中当前位置右侧的文本相匹配,则负向预览将会失败。
Well, if we take into account only the sample you provided here, you might leverage a tempered greedy token (TGT) to negate the extensions you need to match:
/((?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/)(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])*\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))(?!\/|\.[a-z]{1,3})/See the regex demo
The pattern breakdown:
( # Group 1 matching the whole URL (?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/) # Matching URL part with no spaces up to the last / (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* # TGT matching up to the extension \.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg) # The extension ) (?!\/|\.[a-z]{1,3}) # Only if not followed with /, or another extensionThe (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* TGT matches any character other than / that is not the first character of a .mp4, .mkv, etc. literal character sequences (as the negative lookahead fails the match if its pattern matches the text to the right of the current location in the string.
正则表达式:有什么我应该用来实现我想要的?(Regex: Is there something else I should be using to achieve what I want?)基于大量搜索示例,我创建了一个正则表达式,我使用(作为后备)来解析HTML源代码中的直接文件链接:
/((?:(?:https?%3A%2F%2F)(?:www\.)?(?:\S+)%2F|(?:https?:\/\/)(?:www\.)?(?:\S+)\/)(?:.*)?\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg)(?=[^.]*$))/igm我的问题是,它在其中有多个链接的行上失败。 我知道使用正则表达式解析HTML即使作为回退也是如此,我还有什么可以用来查找页面源代码中的所有直接文件链接? (这意味着链接隐藏在内联JavaScript,视频源标签等中,而不仅仅是document.links返回的内容。)
如果没有更好的建议,有人可以帮助我修复正则表达式来实现我想要的吗?
正则表达式应遵循以下准则:
返回组1中的URL和组2中的文件扩展名 查找编码和解码的URL 查找特定的文件扩展名(即视频和音频) 允许多级文件扩展名 允许URL中的空格 允许使用或不使用“www”的HTTP方案的任何安全和非安全域 查找所有网址,而不管它们在HTML源代码中的位置 与JavaScript兼容一些应该匹配的例子:
http://test.com/test.mkv http://test.com/test/test.jpg.mkv https://test.com/test.mkv?test=test http%3A%2F%2Ftest.com%2Ftest.mkv%3Ftest%3Dtest https%3A%2F%2Ftest.com%2Ftest.jpg.mkv%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv最后一个例子应该匹配两个URL,而不是__some__random__string__ 。
一些不应该匹配的示例:
http://test.com/test.mkv.jpg http://test.com/test.mkv/test.jpg https://test.com/test.mkv.jpg?test=test.mkv http%3A%2F%2Ftest.com%2Ftest.mkv.jpg https%3A%2F%2Ftest.com%2Ftest.mkv.jpg%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv.jpg最后一个例子应该只匹配__some__random__string__之前的第一个URL。
您可以使用正则表达式以及http://regexr.com/3dbac部分失败的HTML源代码示例
Based off lots of examples from searching, I have created a regex that I use (as a fallback) to parse direct file links from HTML source:
/((?:(?:https?%3A%2F%2F)(?:www\.)?(?:\S+)%2F|(?:https?:\/\/)(?:www\.)?(?:\S+)\/)(?:.*)?\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg)(?=[^.]*$))/igmMy problem is that it fails on lines that have more than one link in them. I know that parsing HTML using a regex even as a fallback is frowned upon so, what else could I use to find ALL direct file links in a page's source? (This means links hidden in inline JavaScript, video source tags and the like; not just what document.links returns.)
If there are not any better suggestions, can someone help me fix the regex to achieve what I want?
The regex should follow these guidelines:
Return the URL in group one and the file extension in group two Find both encoded and decoded URLs Find specific file extensions (namely video and audio) Tolerate multi-level file extensions Tolerate spaces in the URL Allow any domain, both secure and non-secure, with or without "www" for the HTTP scheme Find all URLs regardless of their location in the HTML source Be compatible with JavaScriptSome examples that should be matched:
http://test.com/test.mkv http://test.com/test/test.jpg.mkv https://test.com/test.mkv?test=test http%3A%2F%2Ftest.com%2Ftest.mkv%3Ftest%3Dtest https%3A%2F%2Ftest.com%2Ftest.jpg.mkv%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkvThe last example should match the two URLs but, not the __some__random__string__.
Some examples that should not be matched:
http://test.com/test.mkv.jpg http://test.com/test.mkv/test.jpg https://test.com/test.mkv.jpg?test=test.mkv http%3A%2F%2Ftest.com%2Ftest.mkv.jpg https%3A%2F%2Ftest.com%2Ftest.mkv.jpg%3Ftest%3Dtest.mkv http://test.com/t est.mkv__some__random__string__http://test.com/test.mkv.jpgThe last example should match only the first URL, before __some__random__string__.
You can play with the regex and an example of some HTML source that partially fails at: http://regexr.com/3dbac
最满意答案
那么,如果我们只考虑你在这里提供的样本,你可以利用脾气暴躁的令牌 (TGT)来否定你需要匹配的扩展:
/((?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/)(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])*\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))(?!\/|\.[a-z]{1,3})/查看正则表达式演示
模式分解:
( # Group 1 matching the whole URL (?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/) # Matching URL part with no spaces up to the last / (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* # TGT matching up to the extension \.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg) # The extension ) (?!\/|\.[a-z]{1,3}) # Only if not followed with /, or another extension(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* TGT匹配任何字符除/不是/ .mp4 , .mkv等文字字符序列的第一个字符外(如果其匹配模式与字符串中当前位置右侧的文本相匹配,则负向预览将会失败。
Well, if we take into account only the sample you provided here, you might leverage a tempered greedy token (TGT) to negate the extensions you need to match:
/((?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/)(?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])*\.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))(?!\/|\.[a-z]{1,3})/See the regex demo
The pattern breakdown:
( # Group 1 matching the whole URL (?:https?(?:%3A%2F%2F|:\/\/))(?:www\.)?(?:\S+)(?:%2F|\/) # Matching URL part with no spaces up to the last / (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* # TGT matching up to the extension \.(mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg) # The extension ) (?!\/|\.[a-z]{1,3}) # Only if not followed with /, or another extensionThe (?:(?!\.(?:mp4|mkv|wmv|m4v|mov|avi|flv|webm|flac|mka|m4a|aac|ogg))[^\/])* TGT matches any character other than / that is not the first character of a .mp4, .mkv, etc. literal character sequences (as the negative lookahead fails the match if its pattern matches the text to the right of the current location in the string.
发布评论