Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
html Extraction
#3
I have made it further after finding the following code on the forum. However, this still doesn't fire on all cylinders for me because I'm missing the industry, city, state and the date the contact was added. Both of which I need to have from my extraction.

Gintaras, I sure would appreciate it if you could help me figure out the last piece of the puzzle here. I was trying to use the .className to identify the "location" and "industry" classes but for some reason the for loop being used doesn't allow for a sel case to be used to capture this data separately. Finally the last piece of this puzzle is getting the data into columns and rows of a tab deliminated csv file. Any help you could provide with this would be great too.

Code:
Copy      Help
str s=
<BODY>
<div class="detail-container">
        <div class="name-row">
            <a href="/contact/b070f5e9-30d7-3da5-bc39-780c3455b71e">Mitch  Acker</a>
        </div>
        <div class="search-result-subheadline">
            <span class="large-black-text">President, Sales Executive at </span>
            <span class="contact-company-name"><a href="/company/66819229-e58e-36e8-a282-c11f68eb2453" class="clickable">Martinaire Inc</a></span>
        </div>
        <div class="compact-section">
            <div class="location">Addison,
                Texas,
                United States
                <div class="contact-industry">Airlines</div>
            </div>
            <div class="compact-section">
                  <div class="small-data-label">Main:</div>
                  <div class="inline-block black-text"><span id="gc-number-20" class="gc-cs-link" title="Call with Google Voice">972-349-5700</span></div>
                <div>
                    <div class="small-data-label">Email:</div>
                    <a class="black-text" href="mailto:[email protected]">[email protected]</a>
                </div>
            </div>
            <div class="">
            </div>
        </div>
    </div>
<div class="detail-container">
        <div class="name-row">
            <a href="/contact/10611e14-c5b5-3cac-9679-7b69997eb75d">Alex  Abadi</a>
        </div>
        <div class="search-result-subheadline">
            <span class="large-black-text">Chief Executive Officer at </span>
            <span class="contact-company-name"><a href="/company/d0a95324-611b-36b7-8a5b-b753ab957e36" class="clickable">Image Microsystems, Inc.</a></span>
        </div>
        <div class="compact-section">
            <div class="location">Austin,
                Texas,
                United States
                <div class="contact-industry">Computer and Peripheral Equipment Manufacturing</div>
            </div>
            <div class="compact-section">
                  <div class="small-data-label">Main:</div>
                  <div class="inline-block black-text"><span id="gc-number-24" class="gc-cs-link" title="Call with Google Voice">512-623-5621</span></div>
                  <div>
                      <div class="small-data-label">Direct:</div>
                      <div class="inline-block black-text"><span id="gc-number-25" class="gc-cs-link" title="Call with Google Voice">512-623-5642</span></div>
                  </div>
                <div>
                    <div class="small-data-label">Email:</div>
                    <a class="black-text" href="mailto:[email protected]">[email protected]</a>
                </div>
            </div>
            <div class="">
            </div>
        </div>
    </div>
</BODY>

out
s.findreplace("span" "a")
HtmlDoc d.InitFromText(s)
ARRAY(MSHTML.IHTMLElement) h2 div
int i j
d.GetHtmlElements(div "div")
for i 0 div.len
    str cn=div[i].className
    if cn="detail-container"
        d.GetHtmlElements(h2 "a" "" div[i].sourceIndex)
        for j 0 h2.len
            out h2[j].innerText


Thanks Again,

Paul


Messages In This Thread

Forum Jump:


Users browsing this thread: 1 Guest(s)