How to extract the location of shorter sequence based on longer sequence using Python? -


i have file sequence id , information of binding site location. extract location information without a,t,c,g information. shorter sequence above longer sequence shows location , every number on left example in file value 451 location value on left. location of short sequence on longer sequence start 453 (start site) , obtain length of shorter sequence size 21 , add 453 end site 474. can me?

file a.txt

chr1:152806601-152807450          ttcagcaccatggacagcgcc           451  ggcttcagcaccacggacagcgccccacccgcggccctccccccggcggcgcgctccagccggtgtaggcgaggc              ttcagcaccatggacagcgcc             751  agagccccccgggactgcagagagcacctgggaggctggactgggaacgagacatactcgaaggagtaagtgaag   chr10:125364276-125364825                         ttcagcaccatggacagcgcc 301  cagtaatgtggggttgtggtcagcaccatggacagctcccctgttgcttcatattgaggaataggaaagcgccgc          ttcagcaccatggacagcgcc 376  tatctccggatcctggctagctccagccactgcaggtaactgtcttgaatgggcttagaaacatggtgatgtctg 

desired output

chr1:152806601-152807450 453 474 chr1:152806601-152807450 757 778 chr10:125364276-125364825 318 339 chr10:125364276-125364825 378 399 

example code

import re open("a.txt", "r") f:     lines = f.readlines()   label_ptrn = re.compile("")   # insert regular expression sequence id line_ptrn = re.compile("")    # insert regular expression start site inner_ptrn = re.compile("")   # insert regular expression end site  all_matches = [] line in lines:     m = label_ptrn.match(line)     if m:         label = m.groupdict().get("label")         continue     m = line_ptrn.match(line)     if m:         start = m.groupdict().get("start_value")         sequence = m.groupdict().get("sequence")         mi = inner_ptrn.search(sequence)         if not mi:             continue         span = mi.span()         all_matches.append((label, int(start)+span[0], int(start)+span[1]))  open("a_ouput.bed", "w+b") f:     m in all_matches:         f.write('%s\t%i\t%i\n' % m) 

so looks me start position of desired output off one.

        ttcagcaccatggacagcgcc           451  ggcttcagcaccacggacagcgccccacccgcggccctccccccggcggcgcgctccagccggtgtaggcgaggc 

the first t in shorter sequence looks above fourth character of longer sequence. if first character of longer sequence @ position 451 make first character of shorter sequence @ position 454.

if file structure constant here non-regex solution.

result = [] open('file.txt') f:     line in f:         if line.startswith('chr'):             label = line.strip()         elif line[0] == ' ':             # short sequence             length = len(line.strip())             # find index of beginning of short sequence             i, c in enumerate(line):                 if c.isalpha():                     short_index =                     break         elif line[0].isdigit():             # long sequence             n = line.split(' ')[0]             # or             # n = line[:line.index(' ')]             # find index of beginning of long sequence             i, c in enumerate(line):                 if c.isalpha():                     long_index =                     break             start = int(n) + short_index - long_index             # start -= 1             end = start + length             result.append('{} {} {}'.format(label, start, end))             offset, n, start, length = 0, 0, 0, 0 

result

['chr1:152806601-152807450 454 475',  'chr1:152806601-152807450 758 779',  'chr10:125364276-125364825 319 340',  'chr10:125364276-125364825 379 400'] 

if have misinterpreted example data, uncomment start -= 1.


Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -