* Add xpath performer and scene scraping * Add studio scraping * Refactor code * Fix compile error * Don't overwrite performer URL during a scrape |
||
|---|---|---|
| .. | ||
| .gitignore | ||
| .travis.yml | ||
| cache.go | ||
| LICENSE | ||
| query.go | ||
| README.md | ||
htmlquery
Overview
htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.
htmlquery build-in the query object caching feature based on LRU, this feature will caching the recently used XPATH query string. enable caching can avoid re-compile XPath expression each query.
Installation
go get github.com/antchfx/htmlquery
Getting Started
Query, returns matched elements or error.
nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
panic(`not a valid XPath expression.`)
}
Load HTML document from URL.
doc, err := htmlquery.LoadURL("http://example.com/")
Load HTML from document.
filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)
Load HTML document from string.
s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))
Find all A elements.
list := htmlquery.Find(doc, "//a")
Find all A elements that have href attribute.
list := range htmlquery.Find(doc, "//a[@href]")
Find all A elements with href attribute and only return href value.
list := range htmlquery.Find(doc, "//a/@href")
for n := range list{
fmt.Println(htmlquery.InnerText(n)) // output @href value without A element.
}
Find the third A element.
a := htmlquery.FindOne(doc, "//a[3]")
Evaluate the number of all IMG element.
expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)
FAQ
Find() vs QueryAll(), which is better?
Find and QueryAll both do the same things, searches all of matched html nodes.
The Find will panics if you give an error XPath query, but QueryAll will return an error for you.
Can I save my query expression object for the next query?
Yes, you can. We offer the QuerySelector and QuerySelectorAll methods, It will accept your query expression object.
Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.
Disable caching feature
htmlquery.DisableSelectorCache = true
Changelogs
2019-11-19
- Add built-in query object cache feature, avoid re-compilation for the same query string. #16
- Added LoadDoc 18
2019-10-05
- Add new methods that compatible with invalid XPath expression error:
QueryAllandQuery. - Add
QuerySelectorandQuerySelectorAllmethods, supported reused your query object.
2019-02-04
- #7 Removed deprecated
FindEach()andFindEachWithBreak()methods.
2018-12-28
- Avoid adding duplicate elements to list for
Find()method. #6
Tutorial
func main() {
doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
if err != nil {
panic(err)
}
// Find all news item.
list, err := htmlquery.QueryAll(doc, "//ol/li")
if err != nil {
panic(err)
}
for i, n := range list {
a := htmlquery.FindOne(n, "//a")
fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
}
}
List of supported XPath query packages
| Name | Description |
|---|---|
| htmlquery | XPath query package for the HTML document |
| xmlquery | XPath query package for the XML document |
| jsonquery | XPath query package for the JSON document |
Questions
Please let me know if you have any questions.