当url存在时,Python requests.get显示404(Python requests.get showing 404 while url does exists)

http://www.leboncoin.fr/montres_bijoux/671762293.htm

我正试图打开这个网址

import requests s = requests.Session() s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36' s.headers['Host'] = 'www.leboncoin.fr' url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm' r = s.get(url) print r.text

当我运行此脚本时,它在我的终端中显示此错误,

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL /montres_bijoux/671762293.htm was not found on this server.</p> </body></html>

虽然我可以在浏览器中打开相同的网址,但可以看到内容。

可能是什么问题??

http://www.leboncoin.fr/montres_bijoux/671762293.htm

I'm trying to open this url

import requests s = requests.Session() s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36' s.headers['Host'] = 'www.leboncoin.fr' url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm' r = s.get(url) print r.text

when I run this script it shows this error, in my terminal,

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL /montres_bijoux/671762293.htm was not found on this server.</p> </body></html>

while I can open same url in my browser and can see content.

What could be the issue??

最满意答案

甚至没有等待你的测试,我很自信我知道你的bug是什么。

我把这个url手动放入函数调用它工作正常,但如果我读取该文件并直接用该url调用函数,给我错误。 我在读取文件时已经进行了3-4次检查,即使我尝试在被调用函数内打印该url并且我在函数中也接收到该url,因此url完全来自该文件。 仍然不知道发生了什么?

很可能你正在读取URL,例如for line in file:或file.readline或其他一些保留换行符的函数。 所以,你实际上最终得到的不是这个:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm'

… 但是这个:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm\n'

后者将被requests转义为对于不存在的资源而言非常好的URL,因此404错误。

检查这个的最好方法是print repr(url)而不是print(url) 。 这也会发现其他可能的问题,例如嵌入式非打印字符。 它不会找到所有内容 ,例如看起来像的Unicode字符. 但实际上并非如此,但这是一个很好的第一次测试。 (如果没有找到它,则进行第二次测试,将输出,引号和所有内容复制并粘贴到测试脚本中。)

如果这是问题,修复很简单:

url = url.rstrip()

Without even waiting for your test, I'm pretty confident I know what your bug is.

I put this url manually in function call it works fine but if I read that file and directly call function with that url, give me error. I have put 3-4 checks while reading file, url is perfectly coming form the file even I tried to print that url inside the called function and I'm receiving that url in function too. still have no clue what is happening ?

Most likely you're reading the URL with something like for line in file: or file.readline or some other function that preserves newlines. So, what you actually end up with is not this:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm'

… but this:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm\n'

The latter will be escaped by requests into something that's a perfectly good URL for a resource that doesn't exist, hence the 404 error.

The best way to check this is to print repr(url) instead of print(url). This will also find other possible problems, like embedded nonprintable characters. It won't find everything, like Unicode characters that look like . but actually aren't, but it's a good first test. (And if that doesn't find it, for a second test, copy and paste from the output, quotes and all, into your test script.)

If this is the problem, the fix is simple:

url = url.rstrip()当url存在时,Python requests.get显示404(Python requests.get showing 404 while url does exists)

http://www.leboncoin.fr/montres_bijoux/671762293.htm

我正试图打开这个网址

import requests s = requests.Session() s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36' s.headers['Host'] = 'www.leboncoin.fr' url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm' r = s.get(url) print r.text

当我运行此脚本时,它在我的终端中显示此错误,

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL /montres_bijoux/671762293.htm was not found on this server.</p> </body></html>

虽然我可以在浏览器中打开相同的网址,但可以看到内容。

可能是什么问题??

http://www.leboncoin.fr/montres_bijoux/671762293.htm

I'm trying to open this url

import requests s = requests.Session() s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36' s.headers['Host'] = 'www.leboncoin.fr' url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm' r = s.get(url) print r.text

when I run this script it shows this error, in my terminal,

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL /montres_bijoux/671762293.htm was not found on this server.</p> </body></html>

while I can open same url in my browser and can see content.

What could be the issue??

最满意答案

甚至没有等待你的测试,我很自信我知道你的bug是什么。

我把这个url手动放入函数调用它工作正常,但如果我读取该文件并直接用该url调用函数,给我错误。 我在读取文件时已经进行了3-4次检查,即使我尝试在被调用函数内打印该url并且我在函数中也接收到该url,因此url完全来自该文件。 仍然不知道发生了什么?

很可能你正在读取URL,例如for line in file:或file.readline或其他一些保留换行符的函数。 所以,你实际上最终得到的不是这个:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm'

… 但是这个:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm\n'

后者将被requests转义为对于不存在的资源而言非常好的URL,因此404错误。

检查这个的最好方法是print repr(url)而不是print(url) 。 这也会发现其他可能的问题,例如嵌入式非打印字符。 它不会找到所有内容 ,例如看起来像的Unicode字符. 但实际上并非如此,但这是一个很好的第一次测试。 (如果没有找到它,则进行第二次测试,将输出,引号和所有内容复制并粘贴到测试脚本中。)

如果这是问题,修复很简单:

url = url.rstrip()

Without even waiting for your test, I'm pretty confident I know what your bug is.

I put this url manually in function call it works fine but if I read that file and directly call function with that url, give me error. I have put 3-4 checks while reading file, url is perfectly coming form the file even I tried to print that url inside the called function and I'm receiving that url in function too. still have no clue what is happening ?

Most likely you're reading the URL with something like for line in file: or file.readline or some other function that preserves newlines. So, what you actually end up with is not this:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm'

… but this:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm\n'

The latter will be escaped by requests into something that's a perfectly good URL for a resource that doesn't exist, hence the 404 error.

The best way to check this is to print repr(url) instead of print(url). This will also find other possible problems, like embedded nonprintable characters. It won't find everything, like Unicode characters that look like . but actually aren't, but it's a good first test. (And if that doesn't find it, for a second test, copy and paste from the output, quotes and all, into your test script.)

If this is the problem, the fix is simple:

url = url.rstrip()